Sr Site Reliability Engineer

greenhouse

Hyderabad 7 Years Exp Posted 34d ago

Job Description

Key Responsibilities

Execution & CoE Alignment

· Implement SRE frameworks, best practices, and playbooks provided by the CoE.

· Act as a hands-on engineer, contributing to observability, reliability, and incident response initiatives.

· Partner with senior SREs and leadership to maintain consistency in monitoring and incident processes.

· Contribute to automation projects that improve reliability and reduce manual work.

 

Observability & Monitoring

· Build and maintain monitoring solutions with New Relic, Datadog, Prometheus, Grafana, CloudWatch, OpenTelemetry, Graylog.

· Create and refine dashboards, metrics, and alerts for proactive anomaly detection.

· Extend observability coverage across infrastructure, applications, APIs, and databases.

 

Reliability Engineering & Automation

· Implement SLIs, SLOs, SLAs, and error budgets in partnership with product and platform teams.

· Contribute to reducing MTTD and MTTR through improved instrumentation and automation.

· Participate in capacity planning, resiliency testing, and scaling reviews.

· Support chaos engineering and reliability validation activities.

 

Incident & Problem Management

· Participate in incident response, including on-call rotations for 24x7 coverage.

· Assist with root cause analysis (RCA) and implement corrective actions.

· Ensure alignment with ITSM processes for incident, problem, and change management.

· Contribute to playbooks and runbooks to strengthen on-call readiness.

 

Collaboration & Knowledge Sharing

· Collaborate with Engineering, Product, Security, Cloud, and DevSecOps teams to embed reliability practices.

· Provide input on instrumentation, monitoring hooks, and operational readiness for services.

· Work with DBAs and platform teams on database observability and performance optimization.

· Share knowledge within the SRE team and adopt practices from Staff and Principal SREs.

 

Qualifications & Experience

Required

· 7+ years in SRE, Operations, or Infrastructure Engineering.

· Strong hands-on experience with monitoring and observability platforms.

· Experience with tools such as New Relic, Datadog, Prometheus, Grafana, CloudWatch, OpenTelemetry, Graylog.

· Proven experience in incident response, troubleshooting production issues, and improving MTTR/MTTD.

· Good knowledge of SLIs, SLOs, SLAs, and error budgets.

· Hands-on experience with AWS services (EC2, ECS, EKS, networking, scaling groups).

· Proficiency in containers & Kubernetes (Docker, EKS).

· Scripting/programming in Python, Go, or shell scripting.

· Understanding of networking, distributed systems, and high-availability architectures.

· Exposure to ITIL/ITSM processes.

 

Preferred

· Experience in SaaS or healthcare environments.

· Knowledge of databases (MongoDB, Elasticsearch, SQL Server, Oracle).

· Familiarity with chaos engineering and resiliency testing.

· Certifications: AWS Solutions Architect / DevOps Engineer, CKA/CKA

Similar Openings for You