Site Reliability Engineer - VP
citi
Job Description
Key Responsibilities:
-
Deliver against the observability roadmap for Services Technology by building scalable, reusable telemetry solutions.
-
Create and maintain dashboards and visualizations for critical client journeys, including real-time flows across Payments.
-
Guide line-of-business teams in implementing SLIs/SLOs, golden signals, and effective alerting to support operational excellence.
-
Support integration and adoption of observability tooling across on-prem, public cloud (AWS/GCP), and containerized environments (ECS, Kubernetes).
-
Customize shared dashboards and observability components in partnership with CTI and other central Engineering functions, ensuring usability and flexibility.
-
Provide technical support and implementation guidance to SREs and developers facing integration or tooling challenges.
-
Effectively manage the observability book of work for Services Technology and drive initiatives to reduce MTTD and improve recovery outcomes.
-
Serve as a key connection point between line-of-business SREs and central infrastructure functions by gathering tooling feedback, surfacing systemic issues, and influencing platform enhancements via the Services Observability Forum.
-
Stay current with observability trends, including AI/ML-driven insights, anomaly detection, and emerging OSS practices, and assess their applicability.
-
Maintain strong knowledge of observability platform features and vendor offerings to advise teams and maximize the value of tooling investments.
Qualifications:
-
10+ years of experience in SRE, Observability Engineering, or platform infrastructure roles focused on operational telemetry.
-
Hands-on experience in observability tools and stacks such as Grafana, Prometheus, OpenTelemetry, ELK, Splunk, and similar platforms.
-
Deep understanding of SLIs, SLOs, Error Budgets, and telemetry best practices in high-availability environments.
-
Proven ability to troubleshoot integration issues and support observability across hybrid platforms (on-prem, cloud, containers).
-
Experience building dashboards aligned to business outcomes and incident workflows, especially in critical flows like payments.
-
Familiarity with modern observability tooling ecosystems, including AI/ML capabilities, trace correlation, baselining, and alert tuning.
-
Strong interpersonal and collaboration skills — able to operate across federated engineering teams and central infrastructure groups.
-
Experience in enablement or platform teams with a track record of scaling best practices across diverse business units.
Education:
-
Bachelor’s degree in Computer Science, Engineering, or a related technical field, or equivalent practical experience.