AI Solutions and Platforms Operations Engineer
pepsicojobs
Job Description
Responsibilities
- AI Agent Operations Center (70%)
- Build “operations center” capabilities for agent runtime management: agent registry, versioning, deployment tracking, and run histories
- Enable operational workflows such as incident triage, replay/debug runs, trace correlation, and root-cause analysis across agent steps
- Implement operational dashboards and views for agent health: success rate, latency, tool failure rate, cost per run, and loop detection
- Instrument agent flows end-to-end using OpenTelemetry (or equivalent), enabling correlation across prompts, tool calls, retrieval, and responses
- Implement semantic conventions and tagging standards (agent name/version, tool name, model provider, environment, tenant/app)
- Partner with SRE/observability teams to ensure production-grade monitoring, alerting, and operational readiness
- Collaboration with Teams (10%)
- Collaborate with transformation teams and business stakeholders to understand requirements and tailor AI agents to specific domains.
- Work closely with AI platform teams to build scalable and cross-domain AI agents while ensuring end-to-end observability.
- Integration & Deployment (10%)
- Build and maintain CI/CD pipelines for agent services and operations center components, including automated testing and deployment
- Automate onboarding for new agent use cases (templates, scaffolding, configuration checks)
- Drive best practices for secure, scalable, and cost-effective agent deployments
- Continuous Learning (10%)
- Stay updated with the latest advancements in AI and machine learning technologies and integrate these into existing or new AI agents.
- Conduct thorough testing and validation to ensure the reliability and accuracy of AI agents and solutions.
- Stay updated with the latest advancements in AI and machine learning technologies and integrate these into existing or new AI agents.