Sr Site Reliability Engineer
greenhouse
Job Description
Key Responsibilities
Execution & CoE Alignment
· Implement SRE frameworks, best practices, and playbooks provided by the CoE.
· Act as a hands-on engineer, contributing to observability, reliability, and incident response initiatives.
· Partner with senior SREs and leadership to maintain consistency in monitoring and incident processes.
· Contribute to automation projects that improve reliability and reduce manual work.
Observability & Monitoring
· Build and maintain monitoring solutions with New Relic, Datadog, Prometheus, Grafana, CloudWatch, OpenTelemetry, Graylog.
· Create and refine dashboards, metrics, and alerts for proactive anomaly detection.
· Extend observability coverage across infrastructure, applications, APIs, and databases.
Reliability Engineering & Automation
· Implement SLIs, SLOs, SLAs, and error budgets in partnership with product and platform teams.
· Contribute to reducing MTTD and MTTR through improved instrumentation and automation.
· Participate in capacity planning, resiliency testing, and scaling reviews.
· Support chaos engineering and reliability validation activities.
Incident & Problem Management
· Participate in incident response, including on-call rotations for 24x7 coverage.
· Assist with root cause analysis (RCA) and implement corrective actions.
· Ensure alignment with ITSM processes for incident, problem, and change management.
· Contribute to playbooks and runbooks to strengthen on-call readiness.
Collaboration & Knowledge Sharing
· Collaborate with Engineering, Product, Security, Cloud, and DevSecOps teams to embed reliability practices.
· Provide input on instrumentation, monitoring hooks, and operational readiness for services.
· Work with DBAs and platform teams on database observability and performance optimization.
· Share knowledge within the SRE team and adopt practices from Staff and Principal SREs.
Qualifications & Experience
Required
· 7+ years in SRE, Operations, or Infrastructure Engineering.
· Strong hands-on experience with monitoring and observability platforms.
· Experience with tools such as New Relic, Datadog, Prometheus, Grafana, CloudWatch, OpenTelemetry, Graylog.
· Proven experience in incident response, troubleshooting production issues, and improving MTTR/MTTD.
· Good knowledge of SLIs, SLOs, SLAs, and error budgets.
· Hands-on experience with AWS services (EC2, ECS, EKS, networking, scaling groups).
· Proficiency in containers & Kubernetes (Docker, EKS).
· Scripting/programming in Python, Go, or shell scripting.
· Understanding of networking, distributed systems, and high-availability architectures.
· Exposure to ITIL/ITSM processes.
Preferred
· Experience in SaaS or healthcare environments.
· Knowledge of databases (MongoDB, Elasticsearch, SQL Server, Oracle).
· Familiarity with chaos engineering and resiliency testing.
· Certifications: AWS Solutions Architect / DevOps Engineer, CKA/CKA