Site Reliability Engineer II

greenhouse

Remote, India 5 Years Exp Posted 39d ago

Job Description

  • Collaborate closely with Developers, QA, and Product teams during sprint planning to understand release plans, dependencies, and infrastructure requirements.
  • Participate in the application release cycle, ensuring deployments are automated, consistent, and reliable.
  • Manage and operate Kubernetes clusters in Google Kubernetes Engine (GKE) and Amazon Elastic Kubernetes Service (EKS).
  • Develop and manage Terraform modules for provisioning and configuring cloud infrastructure across GCP and AWS.
  • Standardize service deployments using Helm for templating and versioned releases.
  • Build and enhance observability with Prometheus, Grafana, and Datadog to monitor application and platform performance.
  • Design, implement, and maintain GitLab CI/CD pipelines for build, test, and deployment automation.
  • Drive an automation-first culture by developing scripts and tooling in Python, Go, or Shell to minimize manual effort and improve efficiency.
  • Participate in a 24/7 on-call rotation, ensuring quick detection, mitigation, and resolution of incidents.
  • Perform root cause analysis (RCA) and contribute to post-incident reviews to prevent recurrence.
  • Proactively identify reliability or scalability gaps, raise early warnings, and partner with teams to address systemic risks.

Similar Openings for You