Senior SRE Engineer I
digitalocean
Job Description
What You’ll Do:
- Design, automate, and maintain scalable, reliable infrastructure.
- Implement monitoring, alerting, and incident response processes.
- Optimize system performance, capacity planning, and cost efficiency.
- Automate deployments, CI/CD pipelines, and infrastructure as code.
- Troubleshoot production issues, conduct root cause analysis, and improve system resilience.
- Collaborate with developers to enhance reliability and performance.
Key Metrics:
- Service Uptime (SLA/SLO adherence) – Ensuring high availability and minimal downtime.
- MTTR (Mean Time to Recovery) – Reducing the time taken to resolve incidents.
- MTTD (Mean Time to Detect) – Minimizing the time to identify issues.
- Change Failure Rate – Measuring the percentage of failed deployments.
- Incident Frequency & Severity – Tracking recurring issues and their impact.
- Latency & Performance Metrics – Ensuring optimal response times.
- Automation Coverage – Percentage of manual processes replaced by automation.
What You'll Add to DigitalOcean:
- Experience: 3+ years in SRE, DevOps, or related roles.
- Cloud Expertise: Hands-on experience with AWS, GCP, or other cloud platforms.
- Automation & Infrastructure as Code: Proficiency in Terraform, Ansible, or similar tools.
- Monitoring & Observability: Familiarity with Prometheus, Grafana, Datadog, or similar tools.
- Containerization & Orchestration: Experience with Kubernetes, Docker, or related technologies.
- Programming Skills: Proficiency in Python, Go, or Bash scripting.
- Incident Management: Strong problem-solving skills with a focus on root cause