Lead AI Engineer
Capgemini
Job Description
In this role, you will:
- Architect and implement distributed AI runtime systems with elastic scaling and job recovery.
- Optimize performance at low levels (CUDA, NCCL, PyTorch internals) for multi-GPU workloads.
- Develop custom runtime architectures for large-scale AI training pipelines.
- Integrate orchestration tools like Kubernetes, Ray, TorchElastic, Horovod for containerized AI workloads.
- Implement fault recovery mechanisms and observability hooks for runtime health monitoring.
- Collaborate with AI researchers and platform engineers to ensure efficient resource utilization and throughput optimization.
- Contribute to CI/CD pipelines for AI infrastructure and runtime deployments.