Site Reliability Engineer
Natwest
Job Description
Your role will also involve:
- Anchor & provide strategic direction regarding technologies & solutions in Digital operations. Lead infrastructure & application builds & technical maintenance along with the core engineering & delivery teams.
- Custodian of SRE SLO, SLI & Error Budgets.
- Application scalability & optimization: Assist in designing and implementing scalable, highly available system architectures to handle increasing loads and user demands without compromising performance.
- Creating and optimizing CI/CD pipelines to automate testing and deployment processes, reducing the time from development to production and ensuring consistent quality control.
- Designing, Monitoring & Responding to system alerts, Monitoring system performance, identifying bottlenecks, and executing optimization & permanent fixes.
- Managing incident response protocols, including on-call rotations.
- Conducting post-incident reviews to prevent recurrence and refine the system reliability framework.
- Provide primary operational support and engineering for multiple large-scale distributed software applications
- Collaborate with development operations staff to create, monitor, and troubleshoot the system infrastructure.
- Increase system resilience and serve larger customer volumes with expert-level coding, bulletproof release, and change management skills.
- Improve automation and increase the system’s self-healing capability.
- Collect operating system data and report performance metrics to stakeholders.
- Manage cloud and database system maintenance, debugging production issues as they arise.
- Ensuring the effective and seamless integration of security policies and practices to DevOps workflows to reduce overall risks and deliver products and services on time.
- Implement the E2E automated VAPT for any new or existing application.
- Reduce the planned deployment downtime by ensuring robust CI/CD setup by 50%.
- MTTR (Mean time to recovery) to less than 2 hr for any major issues, MTTD (Mean time to detect) to less than 5 min with help of automated tools & methods.