Senior Site Reliability Engineer
veeam
Job Description
Reliability Engineering & Resilience
-
Design and evolve infrastructure to be highly available, fault tolerant, and scalable across public clouds (initially Azure, with future expansion plans to other providers).
-
Establish and maintain SLIs, SLOs, and error budgets that define and enforce reliability objectives.
-
Lead incident response, analysis, blameless postmortems, and sharing sessions in order to maximize learning across our entire engineering team and driving changes to the entire socio-technical engineering system.
Observability & Operational Excellence
-
Drive adoption of deep observability practices, ensuring telemetry, logs, metrics, and tracing are comprehensive and actionable.
-
Develop automation and self-healing tools to reduce toil and support Veeam’s fleet management strategy.
-
Participate in on-call rotations and lead operational excellence across the stack.
Engineering at Scale
-
Contribute to infrastructure as code (IaC), CI/CD systems, deployment automation, and scalable config management.
-
Integrate and extend monitoring and chaos engineering tools to validate reliability assumptions under load and failure conditions.
-
Implement testing strategies, canary deployments, and release validation pipelines to protect production environments and allow teams to safely deliver new features as quickly as possible.
Collaboration & Culture
-
Embed within product and platform teams to champion reliability from design through delivery.
-
Contribute to a learning culture focused on continuous improvement and proactive risk management.
-
Mentor engineers and advocate for DevOps/SRE best practices across global teams.
What we expect from you:
- 5+ years of hands-on experience in a Software Engineering role with at least 2 years in Site Reliability, Platform Engineering, or similar.
-
Deep experience building systems on public cloud providers (Azure preferred)
-
Strong programming skills in JS, Node, Typescript, Go, Java, C#, or similar.
-
Proven track record in delivering monitoring, alerting, and observability tooling (e.g., Prometheus, Grafana, OpenTelemetry).
-
Experience with IaC tools like Terraform/Pulumi, and container orchestration (e.g., Kubernetes).
-
Solid understanding of distributed systems, cloud networking, and cloud-native system design.
-
Excellent communication and collaboration skills across geographies and disciplines.
Will be an added advantage:
-
Experience working on large-scale B2B SaaS platforms.
-
Background in chaos engineering, resilience testing, performance testing, load testing, or incident learning programs.
-
Familiarity with compliance frameworks (e.g., ISO, SOC 2, GDPR, FEDRAMP/CMMC).