AI Solutions and Platforms Operations Engineer

pepsicojobs

Hyderabad 3 Years Exp Posted 12d ago

Job Description

Responsibilities

  1. AI Agent Operations Center (70%)
    • Build “operations center” capabilities for agent runtime management: agent registry, versioning, deployment tracking, and run histories
    • Enable operational workflows such as incident triage, replay/debug runs, trace correlation, and root-cause analysis across agent steps
    • Implement operational dashboards and views for agent health: success rate, latency, tool failure rate, cost per run, and loop detection
    • Instrument agent flows end-to-end using OpenTelemetry (or equivalent), enabling correlation across prompts, tool calls, retrieval, and responses
    • Implement semantic conventions and tagging standards (agent name/version, tool name, model provider, environment, tenant/app)
    • Partner with SRE/observability teams to ensure production-grade monitoring, alerting, and operational readiness
  2. Collaboration with Teams (10%)
    • Collaborate with transformation teams and business stakeholders to understand requirements and tailor AI agents to specific domains.
    • Work closely with AI platform teams to build scalable and cross-domain AI agents while ensuring end-to-end observability.
  3. Integration & Deployment (10%)
    • Build and maintain CI/CD pipelines for agent services and operations center components, including automated testing and deployment
    • Automate onboarding for new agent use cases (templates, scaffolding, configuration checks)
    • Drive best practices for secure, scalable, and cost-effective agent deployments
  4. Continuous Learning (10%)
    • Stay updated with the latest advancements in AI and machine learning technologies and integrate these into existing or new AI agents.
      • Conduct thorough testing and validation to ensure the reliability and accuracy of AI agents and solutions.

Similar Openings for You