Staff Site Reliability Engineer

veeam

Pune 8 Years Exp Posted 305d ago

Reliability Engineering & Resilience:

Act as a technical authority in your area, mentoring senior engineers and guiding design choices that improve service reliability and resilience
Lead the definition and enforcement of SLIs, SLOs, and error budgets; drive adherence across engineering teams
Collaborate with Staff peers across teams to align strategy and champion shared reliability standards and goals
Partner with development and product teams to proactively design for failure, build resilient architecture, and operationalize reliability from the start

Observability & Operational Excellence:

Drive company-wide adoption of observability best practices and tooling
Ensure metrics, logs, and traces provide deep, actionable insights across systems
Lead complex incident responses, postmortems, and systemic reliability improvements
Promote and enforce a blameless culture of learning and continuous improvement

Engineering at Scale:

Lead initiatives in infrastructure as code, deployment automation, and resilience testing
Influence the development and adoption of chaos engineering practices and release validation frameworks
Partner with platform and security teams to ensure production readiness

Collaboration & Culture:

Work closely with your peer Staff Engineers to plan, align, and deliver against reliability goals
Provide architectural guidance and advocate for engineering rigor and consistency
Represent the SRE team in technical leadership forums and product planning discussions

What we expect from you:

8+ years of experience in a Software Engineering or SRE role, including technical leadership
Demonstrated experience mentoring and guiding senior engineers
Deep expertise in building distributed systems on public cloud (Azure preferred)
Strong skills in programming (e.g., JS, Go, Typescript, Java, or C#)
Hands-on experience with observability tooling (e.g., Prometheus, Grafana, OpenTelemetry)
Mastery of infrastructure automation tools (Terraform, Pulumi) and container orchestration (Kubernetes)
Ability to communicate clearly across geographies and disciplines

Will be an advantage:

Experience leading SRE initiatives across multiple product teams
Background in chaos engineering, incident learning, or performance and load testing
Familiarity with global compliance standards (ISO, SOC 2, GDPR, FedRAMP, CMMC)

We offer:

Family Medical Insurance
Annual flexible spending allowance for health and well-being
Life insurance
Personal accident insurance
Employee Assistance Program
A comprehensive leave package, including parental leave
Meal Benefit Pass
Transportation Allowance
Daycare/Child care Allowance
Veeam Care Days – additional 24 hours for your volunteering activities
Professional training and education, including courses and workshops, internal meetups, and unlimited access to our online learning platforms (Percipio, Athena, O’Reilly) and mentoring through our MentorLab program