Senior Site Reliability Engineer

veeam

Bengaluru 5 Years Exp Posted 305d ago

Reliability Engineering & Resilience

Design and evolve infrastructure to be highly available, fault tolerant, and scalable across public clouds (initially Azure, with future expansion plans to other providers).
Establish and maintain SLIs, SLOs, and error budgets that define and enforce reliability objectives.
Lead incident response, analysis, blameless postmortems, and sharing sessions in order to maximize learning across our entire engineering team and driving changes to the entire socio-technical engineering system.

Observability & Operational Excellence

Drive adoption of deep observability practices, ensuring telemetry, logs, metrics, and tracing are comprehensive and actionable.
Develop automation and self-healing tools to reduce toil and support Veeam’s fleet management strategy.
Participate in on-call rotations and lead operational excellence across the stack.

Engineering at Scale

Contribute to infrastructure as code (IaC), CI/CD systems, deployment automation, and scalable config management.
Integrate and extend monitoring and chaos engineering tools to validate reliability assumptions under load and failure conditions.
Implement testing strategies, canary deployments, and release validation pipelines to protect production environments and allow teams to safely deliver new features as quickly as possible.

Collaboration & Culture

Embed within product and platform teams to champion reliability from design through delivery.
Contribute to a learning culture focused on continuous improvement and proactive risk management.
Mentor engineers and advocate for DevOps/SRE best practices across global teams.

What we expect from you:

5+ years of hands-on experience in a Software Engineering role with at least 2 years in Site Reliability, Platform Engineering, or similar.

Deep experience building systems on public cloud providers (Azure preferred)
Strong programming skills in JS, Node, Typescript, Go, Java, C#, or similar.
Proven track record in delivering monitoring, alerting, and observability tooling (e.g., Prometheus, Grafana, OpenTelemetry).
Experience with IaC tools like Terraform/Pulumi, and container orchestration (e.g., Kubernetes).
Solid understanding of distributed systems, cloud networking, and cloud-native system design.
Excellent communication and collaboration skills across geographies and disciplines.

Will be an added advantage:

Experience working on large-scale B2B SaaS platforms.
Background in chaos engineering, resilience testing, performance testing, load testing, or incident learning programs.
Familiarity with compliance frameworks (e.g., ISO, SOC 2, GDPR, FEDRAMP/CMMC).