Site Reliability Engineer - VP

citi

Pune 10 Years Exp Posted 412d ago

Key Responsibilities:

Deliver against the observability roadmap for Services Technology by building scalable, reusable telemetry solutions.
Create and maintain dashboards and visualizations for critical client journeys, including real-time flows across Payments.
Guide line-of-business teams in implementing SLIs/SLOs, golden signals, and effective alerting to support operational excellence.
Support integration and adoption of observability tooling across on-prem, public cloud (AWS/GCP), and containerized environments (ECS, Kubernetes).
Customize shared dashboards and observability components in partnership with CTI and other central Engineering functions, ensuring usability and flexibility.
Provide technical support and implementation guidance to SREs and developers facing integration or tooling challenges.
Effectively manage the observability book of work for Services Technology and drive initiatives to reduce MTTD and improve recovery outcomes.
Serve as a key connection point between line-of-business SREs and central infrastructure functions by gathering tooling feedback, surfacing systemic issues, and influencing platform enhancements via the Services Observability Forum.
Stay current with observability trends, including AI/ML-driven insights, anomaly detection, and emerging OSS practices, and assess their applicability.
Maintain strong knowledge of observability platform features and vendor offerings to advise teams and maximize the value of tooling investments.

Qualifications:

10+ years of experience in SRE, Observability Engineering, or platform infrastructure roles focused on operational telemetry.
Hands-on experience in observability tools and stacks such as Grafana, Prometheus, OpenTelemetry, ELK, Splunk, and similar platforms.
Deep understanding of SLIs, SLOs, Error Budgets, and telemetry best practices in high-availability environments.
Proven ability to troubleshoot integration issues and support observability across hybrid platforms (on-prem, cloud, containers).
Experience building dashboards aligned to business outcomes and incident workflows, especially in critical flows like payments.
Familiarity with modern observability tooling ecosystems, including AI/ML capabilities, trace correlation, baselining, and alert tuning.
Strong interpersonal and collaboration skills — able to operate across federated engineering teams and central infrastructure groups.
Experience in enablement or platform teams with a track record of scaling best practices across diverse business units.

Education:

Bachelor’s degree in Computer Science, Engineering, or a related technical field, or equivalent practical experience.