DevOps Engineer | Engineering Team Manager , Vice President
blackrock
Job Description
Key Responsibilities
1) Own reliability outcomes
- Deliver measurable improvements in availability, latency, recovery time, and incident recurrence for critical workloads.
- Establish and mature SRE practices: SLIs/SLOs, error budgets, and reliability governance.
2) Build AI‑enabled, signal‑driven operations
- Replace alert volume with AI‑correlated signals that prioritize true business impact and early risk indicators.
- Implement and improve detection, correlation, and routing workflows integrated into operational processes and tooling.
3) Engineer self‑healing systems (automation-first)
- Design, implement, and govern automated remediation for known failure patterns (with safe guardrails and audit trails).
- Maintain structured human oversight for novel scenarios and ensure continuous learning feeds back into automation.
4) Embed reliability and operability into engineering lifecycle
- Partner with Engineering/Architecture/Product to build operability, observability, resilience-by-design into services.
- Drive root-cause elimination and reduce recurrence through systemic fixes, not repetitive recovery.
5) Change-aware resilience + risk posture
- Own change-aware operations: use AI risk signals to anticipate failures based on historical data, dependency graphs, and weak points.
- Support production readiness: capacity planning, disaster recovery exercises, and disciplined change governance.
6) Evidence, auditability, and resilience expectations
- Ensure AI-driven systems are observable, explainable, and auditable, meeting operational and regulatory expectations.
- Develop and lead high performing global team, fostering strong ownership, technical depth, and a culture of accountability and continuous improvement.
Qualifications / Competencies
- Bachelor’s degree in computer science /engineering (or equivalent practical experience).
- 10+ years across Service Management, DevOps, SRE, Product Engineering, and/or large-scale production operations.
- Strong hands-on experience with observability/monitoring/telemetry platforms, focused on actionable insights and reliability outcomes.
- Proven experience transitioning environments from reactive support to proactive, signal-driven / AI-assisted operations.
- Designed/tuned/governed automation and AIOps workflows, enabling automated remediation while retaining structured human oversight for exceptions.
- Experience implementing change-aware operations, drift detection/correction, and data-driven reliability governance to reduce incident recurrence.
- AI-assisted capacity forecasting and proactive scaling for performance predictability and cost efficiency.
- End-to-end operational fluency: telemetry → ITSM integration → automated execution.
- Experience sponsoring or governing AI-assisted/autonomous operational platforms.