Senior SRE Engineer I

digitalocean

Hyderabad 3 Years Exp Posted 232d ago

Job Description

What You’ll Do:

  • Design, automate, and maintain scalable, reliable infrastructure.
  • Implement monitoring, alerting, and incident response processes.
  • Optimize system performance, capacity planning, and cost efficiency.
  • Automate deployments, CI/CD pipelines, and infrastructure as code.
  • Troubleshoot production issues, conduct root cause analysis, and improve system resilience.
  • Collaborate with developers to enhance reliability and performance.

Key Metrics:

  • Service Uptime (SLA/SLO adherence) – Ensuring high availability and minimal downtime.
  • MTTR (Mean Time to Recovery) – Reducing the time taken to resolve incidents.
  • MTTD (Mean Time to Detect) – Minimizing the time to identify issues.
  • Change Failure Rate – Measuring the percentage of failed deployments.
  • Incident Frequency & Severity – Tracking recurring issues and their impact.
  • Latency & Performance Metrics – Ensuring optimal response times.
  • Automation Coverage – Percentage of manual processes replaced by automation.

What You'll Add to DigitalOcean:

  • Experience: 3+ years in SRE, DevOps, or related roles.
  • Cloud Expertise: Hands-on experience with AWS, GCP, or other cloud platforms.
  • Automation & Infrastructure as Code: Proficiency in Terraform, Ansible, or similar tools.
  • Monitoring & Observability: Familiarity with Prometheus, Grafana, Datadog, or similar tools.
  • Containerization & Orchestration: Experience with Kubernetes, Docker, or related technologies.
  • Programming Skills: Proficiency in Python, Go, or Bash scripting.
  • Incident Management: Strong problem-solving skills with a focus on root cause

Similar Openings for You