Site Reliability Engineer

spglobal

Gurgaon 4 Years Exp Posted 228d ago

Job Description

Key Responsibilities 

 

1. Observability & Proactive System Health 

  • Design, build, and maintain a comprehensive observability platform using tools like Splunk and OpenTelemetry to provide deep insights into system health and performance. 

  • Leverage AIOps principles and platforms to enhance anomaly detection, automate event correlation, and enable predictive alerting, reducing mean time to detection (MTTD). 

  • Develop and manage robust alerting strategies and SLO-based dashboards to ensure critical issues are addressed before they impact customers. 

  • Drive a data-driven culture by providing engineering teams with the visibility they need to understand the impact of their code in production. 

2. Reliability & Resilience Engineering 

  • Design, implement, and conduct Chaos Engineering experiments to proactively identify and remediate system weaknesses, architectural flaws, and potential cascading failures. 

  • Partner with software engineering teams throughout the application lifecycle to architect for high availability, disaster recovery, and fault tolerance. 

  • Define, measure, and evangelize Service Level Indicators (SLIs) and Service Level Objectives (SLOs), and manage the associated error budgets to balance reliability with feature velocity. 

  • Analyze and lead blameless post-mortems for incidents, ensuring that root causes are addressed and preventative measures are implemented to avoid recurrence. 

3. Performance & Efficiency Optimization 

  • Analyze performance metrics and distributed traces to identify and resolve latency bottlenecks across our infrastructure and applications. 

  • Implement cost optimization (FinOps) strategies by identifying and eliminating resource waste, optimizing cloud service usage, and promoting efficient architecture patterns. 

  • Work with development teams to conduct performance testing and ensure new features do not introduce performance regressions. 

4. Automation & Platform Engineering 

  • Identify and aggressively automate manual operational tasks (toil) by developing scripts, tools, and self-healing systems. 

  • Enhance and maintain our Infrastructure as Code (IaC) modules, promoting reusable patterns and best practices with Terraform. 

  • Improve and secure CI/CD pipelines (e.g., GitHub Actions, Azure DevOps) to enable safe, automated, and rapid deployment and rollback procedures. 

 

Requirements and Qualifications 

Core Technical Skills 

  • Experience: 4+ years in a Site Reliability, DevOps, or Cloud Engineering role, with demonstrable experience in a large-scale production environment. 

  • Cloud Proficiency: Deep experience with AWS services (EKS, ECS, EC2, S3, RDS, Lambda) and managing production workloads in the cloud. 

  • Observability: Proficient in application observability, monitoring, and logging. Hands-on experience with tools like Splunk, OpenTelemetry, Prometheus, Grafana, or Datadog is essential. 

  • Infrastructure as Code (IaC): Strong experience with Terraform for provisioning and managing cloud infrastructure. 

  • Containerization: Solid understanding of Containerization Technology particularly with managed services like EKS or ECS. 

  • CI/CD: Experience building and maintaining CI/CD pipelines using tools like GitHub Actions, Azure DevOps, or Jenkins. 

  • Scripting & Automation: Strong scripting skills in languages like Python, Bash, or PowerShell for automation and tooling. Familiarity with a higher-level language such as C# (.NET) is a plus. 

  • Modern Practices: Experience with or a demonstrated understanding of AIOps concepts and Chaos Engineering principles and tools (e.g., Gremlin, AWS Fault Injection Simulator). 

Professional Attributes 

  • SRE Mindset: A true understanding of Site Reliability Engineering principles, including SLOs, error

Similar Openings for You