Principal Site Reliability Engineer
lilly
Job Description
What You’ll Be Doing
-
Lead the SRE team responsible for the reliability and performance of applications deployed on a cloud-native internal platform.
-
Design, implement, and maintain automation frameworks, self-service tooling, and auto-healing systems to eliminate manual toil.
-
Build and enhance end-to-end observability, monitoring, logging, and alerting systems for proactive issue detection and resolution.
-
Ensure Uptime: Take ultimate ownership of our production environment's stability. Lead end-to-end incident management, from escalation to Root Cause Analysis (RCA). Manage patching, upgrades, and disaster recovery processes.
-
Champion Infrastructure as Code (IaC) and CI/CD best practices to ensure consistent, repeatable, and secure deployments.
-
Collaborate with development and product teams to embed reliability and scalability into application design and architecture.
-
Continuously evaluate and introduce emerging tools and technologies to keep the SRE stack modern and efficient.
-
Mentor and guide SRE engineers, fostering a culture of ownership, innovation, and continuous improvement.
-
Implement AIOps frameworks to improve operational tasks and enhance system self-healing capabilities.
-
Participate in and optimise the on-call rotation, striving to minimise human intervention through automation.
-
Drive capacity planning, disaster recovery, and business continuity initiatives.
-
Support onboarding, documentation, and knowledge sharing for platform services and operational best practices.