DevOps Engineer | Engineering Team Manager , Vice President

blackrock

Mumbai, India 10 Years Exp Posted 23d ago

Job Description

Key Responsibilities

1) Own reliability outcomes

  • Deliver measurable improvements in availability, latency, recovery time, and incident recurrence for critical workloads.
  • Establish and mature SRE practices: SLIs/SLOs, error budgets, and reliability governance.

2) Build AI‑enabled, signal‑driven operations

  • Replace alert volume with AI‑correlated signals that prioritize true business impact and early risk indicators.
  • Implement and improve detection, correlation, and routing workflows integrated into operational processes and tooling.

3) Engineer self‑healing systems (automation-first)

  • Design, implement, and govern automated remediation for known failure patterns (with safe guardrails and audit trails).
  • Maintain structured human oversight for novel scenarios and ensure continuous learning feeds back into automation.

4) Embed reliability and operability into engineering lifecycle

  • Partner with Engineering/Architecture/Product to build operability, observability, resilience-by-design into services.
  • Drive root-cause elimination and reduce recurrence through systemic fixes, not repetitive recovery.

5) Change-aware resilience + risk posture

  • Own change-aware operations: use AI risk signals to anticipate failures based on historical data, dependency graphs, and weak points.
  • Support production readiness: capacity planning, disaster recovery exercises, and disciplined change governance.

6) Evidence, auditability, and resilience expectations

  • Ensure AI-driven systems are observable, explainable, and auditable, meeting operational and regulatory expectations.
  • Develop and lead high performing global team, fostering strong ownership, technical depth, and a culture of accountability and continuous improvement.

Qualifications / Competencies

  • Bachelor’s degree in computer science /engineering (or equivalent practical experience).
  • 10+ years across Service Management, DevOps, SRE, Product Engineering, and/or large-scale production operations.
  • Strong hands-on experience with observability/monitoring/telemetry platforms, focused on actionable insights and reliability outcomes.
  • Proven experience transitioning environments from reactive support to proactive, signal-driven / AI-assisted operations.
  • Designed/tuned/governed automation and AIOps workflows, enabling automated remediation while retaining structured human oversight for exceptions.
  • Experience implementing change-aware operations, drift detection/correction, and data-driven reliability governance to reduce incident recurrence.
  • AI-assisted capacity forecasting and proactive scaling for performance predictability and cost efficiency.
  • End-to-end operational fluency: telemetry → ITSM integration → automated execution.
    • Experience sponsoring or governing AI-assisted/autonomous operational platforms.

Similar Openings for You