AI Operations

allianz

pune 2 Years Exp Posted 1h ago

Job Description

Monitoring and Observability

  • Design and implement comprehensive monitoring, alerting, and health‑check frameworks across infra, app, middleware, and AI/GenAI layers.
  • Build dashboards and SLO/SLA telemetry using Grafana, Dynatrace, Azure Monitor, Application Insights, Log Analytics, or equivalent.
  • Define key metrics (availability, latency, error rates, model drift, pipeline throughput) and set automated alerts and escalation paths.
  • Automate health checks and synthetic transactions for critical user journeys and model inference paths.

Upgrades, High Availability, and Roadmap

  • Lead platform and product upgrades, including Active‑Active, Active‑Passive, blue/green and canary deployment strategies.
  • Plan and own upgrade roadmaps in collaboration with Ops, GCC, Engineering, Product, and stakeholders; coordinate maintenance windows and rollback plans.
  • Validate upgrades in pre‑prod and staging, ensure zero/low downtime cutovers, and document upgrade runbooks.

Stability, Incident and Problem Management

  • Own incident lifecycle from detection to resolution and RCA; run incident response and post‑mortems.
  • Drive reliability engineering practices: capacity planning, performance tuning, chaos testing, and resilience patterns.
  • Implement automation for remediation, runbook execution, and incident mitigation to reduce MTTR.
  • Maintain SLAs and report availability and reliability metrics to stakeholders.

Enablement and Adoption

  • Deliver enablement sessions, workshops, and demos to internal teams and customers on how to use AI Automation products.
  • Create and maintain user manuals, quick start guides, runbooks, and FAQs tailored to operators, developers, and business users.
  • Act as SME for onboarding, troubleshooting, and best practices for GenAI/LLM usage and safe model operations.

Production Process Control and Documentation

  • Map and document production processes, data flows, deployment pipelines, and operational dependencies.
  • Create runbooks, SOPs, and playbooks for routine operations, change management, and emergency procedures.
  • Establish governance for change approvals, configuration management, and access controls.

Testing and Release Support

  • Contribute to pre‑prod testing: functional, integration, performance, load, and model validation tests.
  • Coordinate release readiness with QA, DevOps, and engineering; validate CI/CD pipelines and rollback mechanisms.
  • Support canary and staged rollouts, monitor metrics during releases, and authorize promotion to production.

Cross‑Functional Collaboration and Vendor Management

  • Work closely with Dev, SRE, Security, QA, and Product to prioritize reliability work and roadmap items.
  • Coordinate with cloud providers and third‑party vendors for escalations, upgrades, and capacity planning.
    • Communicate status and risks to leadership and stakeholders with clear, actionable reports.

Similar Openings for You