Site Reliability Engineering Manager
hpe
Job Description
What you’ll do:
- Lead and mentor a team of Site Reliability Engineers, supporting their growth, performance, and well-being.
- Own the reliability strategy for SASE cloud infrastructure systems, including incident management, SLIs/SLOs, and capacity planning.
- Partner with Engineering, Product, and Security teams to design and deliver highly available, scalable, and resilient cloud-native services.
- Guide the team in building automation, improving observability, and improve operational efficiency of our cloud infrastructure.
- Drive adoption of best practices in monitoring, alerting, on-call operations, and runbook development.
- Build and maintain a strong engineering culture based on ownership, collaboration, and continuous learning.
- Define and track key reliability metrics, and report on team performance and system health to leadership.
- Contribute to hiring, onboarding, and career development for SREs.
What you need to bring:
- 7–10 years of experience in Site Reliability Engineering, DevOps, or Cloud Infrastructure roles.
- Minimum 2 years of experience managing or leading cloud operations teams.
- Deep understanding of cloud platforms (AWS, GCP, or Azure) and cloud-native architectures.
- Hands-on experience with Kubernetes, containers, infrastructure as code (e.g., Terraform), and configuration management tools.
- Strong foundation in observability (monitoring, logging, tracing), automation using Python, and incident response.
- Familiarity with modern CI/CD automation and tools.
- Excellent communication, stakeholder management, and team-building skills.
- Experience scaling SRE practices in high-growth or large-scale environments.
- Ability to balance long-term reliability initiatives with short-term delivery needs.