Site Reliability Engineer II
greenhouse
Job Description
- Collaborate closely with Developers, QA, and Product teams during sprint planning to understand release plans, dependencies, and infrastructure requirements.
- Participate in the application release cycle, ensuring deployments are automated, consistent, and reliable.
- Manage and operate Kubernetes clusters in Google Kubernetes Engine (GKE) and Amazon Elastic Kubernetes Service (EKS).
- Develop and manage Terraform modules for provisioning and configuring cloud infrastructure across GCP and AWS.
- Standardize service deployments using Helm for templating and versioned releases.
- Build and enhance observability with Prometheus, Grafana, and Datadog to monitor application and platform performance.
- Design, implement, and maintain GitLab CI/CD pipelines for build, test, and deployment automation.
- Drive an automation-first culture by developing scripts and tooling in Python, Go, or Shell to minimize manual effort and improve efficiency.
- Participate in a 24/7 on-call rotation, ensuring quick detection, mitigation, and resolution of incidents.
- Perform root cause analysis (RCA) and contribute to post-incident reviews to prevent recurrence.
- Proactively identify reliability or scalability gaps, raise early warnings, and partner with teams to address systemic risks.