Principal Site Reliability Engineer

lilly

Hyderabad NM Years Exp Posted 180d ago

Job Description

What You’ll Be Doing 

  • Lead the SRE team responsible for the reliability and performance of applications deployed on a cloud-native internal platform. 

  • Design, implement, and maintain automation frameworks, self-service tooling, and auto-healing systems to eliminate manual toil. 

  • Build and enhance end-to-end observability, monitoring, logging, and alerting systems for proactive issue detection and resolution. 

  • Ensure Uptime: Take ultimate ownership of our production environment's stability. Lead end-to-end incident management, from escalation to Root Cause Analysis (RCA). Manage patching, upgrades, and disaster recovery processes. 

  • Champion Infrastructure as Code (IaC) and CI/CD best practices to ensure consistent, repeatable, and secure deployments. 

  • Collaborate with development and product teams to embed reliability and scalability into application design and architecture. 

  • Continuously evaluate and introduce emerging tools and technologies to keep the SRE stack modern and efficient. 

  • Mentor and guide SRE engineers, fostering a culture of ownership, innovation, and continuous improvement. 

  • Implement AIOps frameworks to improve operational tasks and enhance system self-healing capabilities. 

  • Participate in and optimise the on-call rotation, striving to minimise human intervention through automation. 

  • Drive capacity planning, disaster recovery, and business continuity initiatives. 

  • Support onboarding, documentation, and knowledge sharing for platform services and operational best practices. 

 

Similar Openings for You