Lead AI/ML Engineer
spglobal
Job Description
1) Agentic Systems Architecture & Core Engineering
- Design and build multi-agent workflows: Lead hands-on engineering of stateful agentic applications using agent orchestration frameworks capable of coordinating multiple autonomous components.
- Agent-to-agent collaboration: Define and implement robust communication patterns that allow agents to delegate sub-tasks, negotiate execution paths, and coordinate outcomes in dynamic environments.
- State, memory, and long-running execution: Engineer control flows for non-deterministic systems, including message passing, persistent memory, recoverability, and interruptible execution for long-running tasks.
- Standardized tool interfaces: Establish universal interfaces between agents, enterprise data sources, and operational tools to ensure modularity, reusability, and consistent governance.
- Model integration and runtime optimization: Build routing and fallback strategies across multiple model endpoints; optimize context management, latency, and inference cost while maintaining reliability.
- Production deployment: Package and deploy workloads via containerization and cluster orchestration, using cloud-native services for scaling, isolation, and secure runtime operations.
2) Data Engineering & Operational Real-Time Integration
- Build agent-ready data pipelines: Develop and maintain high-throughput ingestion and transformation pipelines that convert raw operational signals into structured, machine-consumable context.
- Real-time context injection: Ensure agents can access near-real-time operational data by designing efficient retrieval patterns and optimizing vector databases and associated retrieval architectures.
- Cross-functional execution: Serve as the technical bridge between AI and data teams—translating agent needs into schemas, data contracts, SLAs, and pipeline specifications, while resolving bottlenecks hands-on.
3) Observability, Governance & Human-in-the-Loop
- LLMOps, tracing, and debugging: Implement end-to-end observability for agent execution, including reasoning traces, performance telemetry, cost monitoring, and production debugging workflows.
- Safety and control frameworks: Design hybrid autonomy modes (human-in-the-loop through fully autonomous), including approval gates, policy enforcement, and “break-glass” controls for sensitive operations.
- Evaluation and reliability standards: Establish rigorous testing strategies for stochastic systems; automate evaluation pipelines to measure accuracy, failure modes, drift, and regression risk prior to deployment.
4) Technical Leadership & Strategy
- Define the agentic architecture roadmap: Partner with product and engineering leadership to scope feasibility, set technical direction, and prioritize high-impact autonomous initiatives.
- Mentorship and engineering standards: Set expectations for code quality, architectural patterns, and review processes; mentor engineers to level up agentic engineering practices.
- Innovation to production: Rapidly prototype emerging approaches (e.g., advanced retrieval strategies, graph-based reasoning patterns) and mature successful experiments into supported production capabilities.