Senior SRE Engineer I

digitalocean

Hyderabad 3 Years Exp Posted 284d ago

What You’ll Do:

Design, automate, and maintain scalable, reliable infrastructure.
Implement monitoring, alerting, and incident response processes.
Optimize system performance, capacity planning, and cost efficiency.
Automate deployments, CI/CD pipelines, and infrastructure as code.
Troubleshoot production issues, conduct root cause analysis, and improve system resilience.
Collaborate with developers to enhance reliability and performance.

Service Uptime (SLA/SLO adherence) – Ensuring high availability and minimal downtime.
MTTR (Mean Time to Recovery) – Reducing the time taken to resolve incidents.
MTTD (Mean Time to Detect) – Minimizing the time to identify issues.
Change Failure Rate – Measuring the percentage of failed deployments.
Incident Frequency & Severity – Tracking recurring issues and their impact.
Latency & Performance Metrics – Ensuring optimal response times.
Automation Coverage – Percentage of manual processes replaced by automation.

Experience: 3+ years in SRE, DevOps, or related roles.
Cloud Expertise: Hands-on experience with AWS, GCP, or other cloud platforms.
Automation & Infrastructure as Code: Proficiency in Terraform, Ansible, or similar tools.
Monitoring & Observability: Familiarity with Prometheus, Grafana, Datadog, or similar tools.
Containerization & Orchestration: Experience with Kubernetes, Docker, or related technologies.
Programming Skills: Proficiency in Python, Go, or Bash scripting.
Incident Management: Strong problem-solving skills with a focus on root cause