Staff Site Reliability Engineer
veeam
Job Description
Reliability Engineering & Resilience:
- Act as a technical authority in your area, mentoring senior engineers and guiding design choices that improve service reliability and resilience
- Lead the definition and enforcement of SLIs, SLOs, and error budgets; drive adherence across engineering teams
- Collaborate with Staff peers across teams to align strategy and champion shared reliability standards and goals
- Partner with development and product teams to proactively design for failure, build resilient architecture, and operationalize reliability from the start
Observability & Operational Excellence:
- Drive company-wide adoption of observability best practices and tooling
- Ensure metrics, logs, and traces provide deep, actionable insights across systems
- Lead complex incident responses, postmortems, and systemic reliability improvements
- Promote and enforce a blameless culture of learning and continuous improvement
Engineering at Scale:
- Lead initiatives in infrastructure as code, deployment automation, and resilience testing
- Influence the development and adoption of chaos engineering practices and release validation frameworks
- Partner with platform and security teams to ensure production readiness
Collaboration & Culture:
- Work closely with your peer Staff Engineers to plan, align, and deliver against reliability goals
- Provide architectural guidance and advocate for engineering rigor and consistency
- Represent the SRE team in technical leadership forums and product planning discussions
What we expect from you:
- 8+ years of experience in a Software Engineering or SRE role, including technical leadership
- Demonstrated experience mentoring and guiding senior engineers
- Deep expertise in building distributed systems on public cloud (Azure preferred)
- Strong skills in programming (e.g., JS, Go, Typescript, Java, or C#)
- Hands-on experience with observability tooling (e.g., Prometheus, Grafana, OpenTelemetry)
- Mastery of infrastructure automation tools (Terraform, Pulumi) and container orchestration (Kubernetes)
- Ability to communicate clearly across geographies and disciplines
Will be an advantage:
- Experience leading SRE initiatives across multiple product teams
- Background in chaos engineering, incident learning, or performance and load testing
- Familiarity with global compliance standards (ISO, SOC 2, GDPR, FedRAMP, CMMC)
We offer:
- Family Medical Insurance
- Annual flexible spending allowance for health and well-being
- Life insurance
- Personal accident insurance
- Employee Assistance Program
- A comprehensive leave package, including parental leave
- Meal Benefit Pass
- Transportation Allowance
- Daycare/Child care Allowance
- Veeam Care Days – additional 24 hours for your volunteering activities
- Professional training and education, including courses and workshops, internal meetups, and unlimited access to our online learning platforms (Percipio, Athena, O’Reilly) and mentoring through our MentorLab program