Site Reliability Engineer
UPS
Job Description
Key Responsibilities:
- Design, develop, and maintain reliable, scalable, and highly available systems on GCP.
- Build and manage CI/CD pipelines, infrastructure as code (IaC), and monitoring solutions.
- Proactively monitor and manage system performance, uptime, and capacity using observability tools.
- Troubleshoot and resolve infrastructure and application-level issues in real-time.
- Implement and maintain disaster recovery, failover mechanisms, and backup strategies.
- Automate repetitive tasks and processes to improve efficiency and reduce toil.
- Participate in on-call rotations, incident management, and root cause analysis (RCA).
- Ensure compliance with security standards, privacy regulations, and governance policies.
- Collaborate with cross-functional teams to support DevOps and SRE best practices.
- Drive improvements in SLAs, SLOs, and error budgets through data-driven insights.
Required Qualifications:
- 5–8 years of relevant experience as an SRE, DevOps Engineer, or Cloud Infrastructure Engineer.
- Strong hands-on experience with Google Cloud Platform (GCP) – Compute Engine, GKE, Cloud Functions, Cloud Storage, IAM, BigQuery, etc.
- Proficiency in Infrastructure as Code tools like Terraform, Deployment Manager, or CloudFormation.
- Experience with Kubernetes, Docker, and container orchestration.
- Proficiency in scripting languages like Python, Shell, or Go.
- Deep understanding of monitoring and logging tools such as Prometheus, Grafana, Stackdriver, or Datadog.
- Knowledge of CI/CD tools such as Jenkins, GitLab CI, or Cloud Build.
- Experience with incident response, postmortem analysis, and site reliability principles.
- Strong problem-solving and communication skills.