Senior Site Reliability Engineer

veeam

Bengaluru 5 Years Exp Posted 253d ago

Job Description

Reliability Engineering & Resilience

  • Design and evolve infrastructure to be highly available, fault tolerant, and scalable across public clouds (initially Azure, with future expansion plans to other providers).

  • Establish and maintain SLIs, SLOs, and error budgets that define and enforce reliability objectives.

  • Lead incident response, analysis, blameless postmortems, and sharing sessions in order to maximize learning across our entire engineering team and driving changes to the entire socio-technical engineering system.

Observability & Operational Excellence

  • Drive adoption of deep observability practices, ensuring telemetry, logs, metrics, and tracing are comprehensive and actionable.

  • Develop automation and self-healing tools to reduce toil and support Veeam’s fleet management strategy.

  • Participate in on-call rotations and lead operational excellence across the stack.

Engineering at Scale

  • Contribute to infrastructure as code (IaC), CI/CD systems, deployment automation, and scalable config management.

  • Integrate and extend monitoring and chaos engineering tools to validate reliability assumptions under load and failure conditions.

  • Implement testing strategies, canary deployments, and release validation pipelines to protect production environments and allow teams to safely deliver new features as quickly as possible.

Collaboration & Culture

  • Embed within product and platform teams to champion reliability from design through delivery.

  • Contribute to a learning culture focused on continuous improvement and proactive risk management.

  • Mentor engineers and advocate for DevOps/SRE best practices across global teams.

What we expect from you:

  • 5+ years of hands-on experience in a Software Engineering role with at least 2 years in Site Reliability, Platform Engineering, or similar.
  • Deep experience building systems on public cloud providers (Azure preferred)

  • Strong programming skills in JS, Node, Typescript, Go, Java, C#, or similar.

  • Proven track record in delivering monitoring, alerting, and observability tooling (e.g., Prometheus, Grafana, OpenTelemetry).

  • Experience with IaC tools like Terraform/Pulumi, and container orchestration (e.g., Kubernetes).

  • Solid understanding of distributed systems, cloud networking, and cloud-native system design.

  • Excellent communication and collaboration skills across geographies and disciplines.

Will be an added advantage:

  • Experience working on large-scale B2B SaaS platforms.

  • Background in chaos engineering, resilience testing, performance testing, load testing, or incident learning programs.

  • Familiarity with compliance frameworks (e.g., ISO, SOC 2, GDPR, FEDRAMP/CMMC).

Similar Openings for You