Site Reliability Engineer
spglobal
Job Description
Key Responsibilities
1. Observability & Proactive System Health
-
Design, build, and maintain a comprehensive observability platform using tools like Splunk and OpenTelemetry to provide deep insights into system health and performance.
-
Leverage AIOps principles and platforms to enhance anomaly detection, automate event correlation, and enable predictive alerting, reducing mean time to detection (MTTD).
-
Develop and manage robust alerting strategies and SLO-based dashboards to ensure critical issues are addressed before they impact customers.
-
Drive a data-driven culture by providing engineering teams with the visibility they need to understand the impact of their code in production.
2. Reliability & Resilience Engineering
-
Design, implement, and conduct Chaos Engineering experiments to proactively identify and remediate system weaknesses, architectural flaws, and potential cascading failures.
-
Partner with software engineering teams throughout the application lifecycle to architect for high availability, disaster recovery, and fault tolerance.
-
Define, measure, and evangelize Service Level Indicators (SLIs) and Service Level Objectives (SLOs), and manage the associated error budgets to balance reliability with feature velocity.
-
Analyze and lead blameless post-mortems for incidents, ensuring that root causes are addressed and preventative measures are implemented to avoid recurrence.
3. Performance & Efficiency Optimization
-
Analyze performance metrics and distributed traces to identify and resolve latency bottlenecks across our infrastructure and applications.
-
Implement cost optimization (FinOps) strategies by identifying and eliminating resource waste, optimizing cloud service usage, and promoting efficient architecture patterns.
-
Work with development teams to conduct performance testing and ensure new features do not introduce performance regressions.
4. Automation & Platform Engineering
-
Identify and aggressively automate manual operational tasks (toil) by developing scripts, tools, and self-healing systems.
-
Enhance and maintain our Infrastructure as Code (IaC) modules, promoting reusable patterns and best practices with Terraform.
-
Improve and secure CI/CD pipelines (e.g., GitHub Actions, Azure DevOps) to enable safe, automated, and rapid deployment and rollback procedures.
Requirements and Qualifications
Core Technical Skills
-
Experience: 4+ years in a Site Reliability, DevOps, or Cloud Engineering role, with demonstrable experience in a large-scale production environment.
-
Cloud Proficiency: Deep experience with AWS services (EKS, ECS, EC2, S3, RDS, Lambda) and managing production workloads in the cloud.
-
Observability: Proficient in application observability, monitoring, and logging. Hands-on experience with tools like Splunk, OpenTelemetry, Prometheus, Grafana, or Datadog is essential.
-
Infrastructure as Code (IaC): Strong experience with Terraform for provisioning and managing cloud infrastructure.
-
Containerization: Solid understanding of Containerization Technology particularly with managed services like EKS or ECS.
-
CI/CD: Experience building and maintaining CI/CD pipelines using tools like GitHub Actions, Azure DevOps, or Jenkins.
-
Scripting & Automation: Strong scripting skills in languages like Python, Bash, or PowerShell for automation and tooling. Familiarity with a higher-level language such as C# (.NET) is a plus.
-
Modern Practices: Experience with or a demonstrated understanding of AIOps concepts and Chaos Engineering principles and tools (e.g., Gremlin, AWS Fault Injection Simulator).
Professional Attributes
-
SRE Mindset: A true understanding of Site Reliability Engineering principles, including SLOs, error