DevOps Engineer | Engineering Team Manager , Vice President

blackrock

Mumbai, India 10 Years Exp Posted 72d ago

Key Responsibilities

Deliver measurable improvements in availability, latency, recovery time, and incident recurrence for critical workloads.
Establish and mature SRE practices: SLIs/SLOs, error budgets, and reliability governance.

Replace alert volume with AI‑correlated signals that prioritize true business impact and early risk indicators.
Implement and improve detection, correlation, and routing workflows integrated into operational processes and tooling.

Design, implement, and govern automated remediation for known failure patterns (with safe guardrails and audit trails).
Maintain structured human oversight for novel scenarios and ensure continuous learning feeds back into automation.

Partner with Engineering/Architecture/Product to build operability, observability, resilience-by-design into services.
Drive root-cause elimination and reduce recurrence through systemic fixes, not repetitive recovery.

Own change-aware operations: use AI risk signals to anticipate failures based on historical data, dependency graphs, and weak points.
Support production readiness: capacity planning, disaster recovery exercises, and disciplined change governance.

Ensure AI-driven systems are observable, explainable, and auditable, meeting operational and regulatory expectations.
Develop and lead high performing global team, fostering strong ownership, technical depth, and a culture of accountability and continuous improvement.

Qualifications / Competencies

Bachelor’s degree in computer science /engineering (or equivalent practical experience).
10+ years across Service Management, DevOps, SRE, Product Engineering, and/or large-scale production operations.
Strong hands-on experience with observability/monitoring/telemetry platforms, focused on actionable insights and reliability outcomes.
Proven experience transitioning environments from reactive support to proactive, signal-driven / AI-assisted operations.
Designed/tuned/governed automation and AIOps workflows, enabling automated remediation while retaining structured human oversight for exceptions.
Experience implementing change-aware operations, drift detection/correction, and data-driven reliability governance to reduce incident recurrence.
AI-assisted capacity forecasting and proactive scaling for performance predictability and cost efficiency.
End-to-end operational fluency: telemetry → ITSM integration → automated execution.
- Experience sponsoring or governing AI-assisted/autonomous operational platforms.