Site Reliability Engineer

spglobal

Gurgaon 4 Years Exp Posted 280d ago

Job Description

Key Responsibilities

1. Observability & Proactive System Health

Design, build, and maintain a comprehensive observability platform using tools like Splunk and OpenTelemetry to provide deep insights into system health and performance.

Leverage AIOps principles and platforms to enhance anomaly detection, automate event correlation, and enable predictive alerting, reducing mean time to detection (MTTD).

Develop and manage robust alerting strategies and SLO-based dashboards to ensure critical issues are addressed before they impact customers.

Drive a data-driven culture by providing engineering teams with the visibility they need to understand the impact of their code in production.

2. Reliability & Resilience Engineering

Design, implement, and conduct Chaos Engineering experiments to proactively identify and remediate system weaknesses, architectural flaws, and potential cascading failures.

Partner with software engineering teams throughout the application lifecycle to architect for high availability, disaster recovery, and fault tolerance.

Define, measure, and evangelize Service Level Indicators (SLIs) and Service Level Objectives (SLOs), and manage the associated error budgets to balance reliability with feature velocity.

Analyze and lead blameless post-mortems for incidents, ensuring that root causes are addressed and preventative measures are implemented to avoid recurrence.

3. Performance & Efficiency Optimization

Analyze performance metrics and distributed traces to identify and resolve latency bottlenecks across our infrastructure and applications.

Implement cost optimization (FinOps) strategies by identifying and eliminating resource waste, optimizing cloud service usage, and promoting efficient architecture patterns.

Work with development teams to conduct performance testing and ensure new features do not introduce performance regressions.

4. Automation & Platform Engineering

Identify and aggressively automate manual operational tasks (toil) by developing scripts, tools, and self-healing systems.

Enhance and maintain our Infrastructure as Code (IaC) modules, promoting reusable patterns and best practices with Terraform.

Improve and secure CI/CD pipelines (e.g., GitHub Actions, Azure DevOps) to enable safe, automated, and rapid deployment and rollback procedures.

Requirements and Qualifications

Core Technical Skills

Experience: 4+ years in a Site Reliability, DevOps, or Cloud Engineering role, with demonstrable experience in a large-scale production environment.

Cloud Proficiency: Deep experience with AWS services (EKS, ECS, EC2, S3, RDS, Lambda) and managing production workloads in the cloud.

Observability: Proficient in application observability, monitoring, and logging. Hands-on experience with tools like Splunk, OpenTelemetry, Prometheus, Grafana, or Datadog is essential.

Infrastructure as Code (IaC): Strong experience with Terraform for provisioning and managing cloud infrastructure.

Containerization: Solid understanding of Containerization Technology particularly with managed services like EKS or ECS.

CI/CD: Experience building and maintaining CI/CD pipelines using tools like GitHub Actions, Azure DevOps, or Jenkins.

Scripting & Automation: Strong scripting skills in languages like Python, Bash, or PowerShell for automation and tooling. Familiarity with a higher-level language such as C# (.NET) is a plus.

Modern Practices: Experience with or a demonstrated understanding of AIOps concepts and Chaos Engineering principles and tools (e.g., Gremlin, AWS Fault Injection Simulator).

Professional Attributes

SRE Mindset: A true understanding of Site Reliability Engineering principles, including SLOs, error

Similar Openings for You

Database Testing

capgemini • Hyderabad

Quality Automation Specialist, AVP

natwestgroup • Gurugram

Automation Tester

cgi • Bangalore

Quality Engineer

accenture • Bengaluru, India