SRE

virtusa

Pune 7 Years Exp Posted 409d ago

Job Description

GCP and private cloud operational support administration activities such as provision capacity management reliability management monitoring restoration etc
Kubernetes cluster management monitoring and remediation Knowledge of Docker is important
Proficient in GCP particularly GKE IAM Container Registry Helm Routing services Cloud Load Balancing and Cloud SQL
Automating deployments and scripting self healing workflows based on telemetry
Understand how to build observability into GCP hosted applications to expose telemetry onto Observability tools via logs and metrics
Define SLIs and configure SLOs respond to threshold alerts and optimize monitoring capability
Work with code as well as configuration artifacts to debug and fix issues that may arise
Knowledge of applying SRE practices to daily operations is key
Capacity planning and forecasting
Must be inclined to work on proof of concept solutions to optimize reliability such as those incorporating AI models for event correlation and assisted triaging
Ability to work in shifts in office is mandatory; this is a 24x7 on desk operation
Ability to understand and manage clusters and ingress into GKE on GCP
Knowledge of Terraform for IaC and automation is beneficial
Disaster recovery backups and ensuring service continuity especially during region failures
Must be proficient with an ITSM tool like ServiceNow or Remedy

Qualification

GCP and private cloud operational support administration activities such as provision capacity management reliability management monitoring restoration etc
Kubernetes cluster management monitoring and remediation Knowledge of Docker is important
Proficient in GCP particularly GKE IAM Container Registry Helm Routing services Cloud Load Balancing and Cloud SQL
Automating deployments and scripting self healing workflows based on telemetry
Understand how to build observability into GCP hosted applications to expose telemetry onto Observability tools via logs and metrics
Define SLIs and configure SLOs respond to threshold alerts and optimize monitoring capability
Work with code as well as configuration artifacts to debug and fix issues that may arise
Knowledge of applying SRE practices to daily operations is key
Capacity planning and forecasting
Must be inclined to work on proof of concept solutions to optimize reliability such as those incorporating AI models for event correlation and assisted triaging
Ability to work in shifts in office is mandatory; this is a 24x7 on desk operation
Ability to understand and manage clusters and ingress into GKE on GCP
Knowledge of Terraform for IaC and automation is beneficial
Disaster recovery backups and ensuring service continuity especially during region failures
Must be proficient with an ITSM tool like ServiceNow or Remedy

Similar Openings for You