AI Operations
allianz
Job Description
Monitoring and Observability
- Design and implement comprehensive monitoring, alerting, and health‑check frameworks across infra, app, middleware, and AI/GenAI layers.
- Build dashboards and SLO/SLA telemetry using Grafana, Dynatrace, Azure Monitor, Application Insights, Log Analytics, or equivalent.
- Define key metrics (availability, latency, error rates, model drift, pipeline throughput) and set automated alerts and escalation paths.
- Automate health checks and synthetic transactions for critical user journeys and model inference paths.
Upgrades, High Availability, and Roadmap
- Lead platform and product upgrades, including Active‑Active, Active‑Passive, blue/green and canary deployment strategies.
- Plan and own upgrade roadmaps in collaboration with Ops, GCC, Engineering, Product, and stakeholders; coordinate maintenance windows and rollback plans.
- Validate upgrades in pre‑prod and staging, ensure zero/low downtime cutovers, and document upgrade runbooks.
Stability, Incident and Problem Management
- Own incident lifecycle from detection to resolution and RCA; run incident response and post‑mortems.
- Drive reliability engineering practices: capacity planning, performance tuning, chaos testing, and resilience patterns.
- Implement automation for remediation, runbook execution, and incident mitigation to reduce MTTR.
- Maintain SLAs and report availability and reliability metrics to stakeholders.
Enablement and Adoption
- Deliver enablement sessions, workshops, and demos to internal teams and customers on how to use AI Automation products.
- Create and maintain user manuals, quick start guides, runbooks, and FAQs tailored to operators, developers, and business users.
- Act as SME for onboarding, troubleshooting, and best practices for GenAI/LLM usage and safe model operations.
Production Process Control and Documentation
- Map and document production processes, data flows, deployment pipelines, and operational dependencies.
- Create runbooks, SOPs, and playbooks for routine operations, change management, and emergency procedures.
- Establish governance for change approvals, configuration management, and access controls.
Testing and Release Support
- Contribute to pre‑prod testing: functional, integration, performance, load, and model validation tests.
- Coordinate release readiness with QA, DevOps, and engineering; validate CI/CD pipelines and rollback mechanisms.
- Support canary and staged rollouts, monitor metrics during releases, and authorize promotion to production.
Cross‑Functional Collaboration and Vendor Management
- Work closely with Dev, SRE, Security, QA, and Product to prioritize reliability work and roadmap items.
- Coordinate with cloud providers and third‑party vendors for escalations, upgrades, and capacity planning.
- Communicate status and risks to leadership and stakeholders with clear, actionable reports.