Lead AI Engineer

Capgemini

Bangalore 7 Years Exp Posted 180d ago

Job Description

In this role, you will:

  • Architect and implement distributed AI runtime systems with elastic scaling and job recovery.
  • Optimize performance at low levels (CUDA, NCCL, PyTorch internals) for multi-GPU workloads.
  • Develop custom runtime architectures for large-scale AI training pipelines.
  • Integrate orchestration tools like Kubernetes, Ray, TorchElastic, Horovod for containerized AI workloads.
  • Implement fault recovery mechanisms and observability hooks for runtime health monitoring.
  • Collaborate with AI researchers and platform engineers to ensure efficient resource utilization and throughput optimization.
    • Contribute to CI/CD pipelines for AI infrastructure and runtime deployments.

Similar Openings for You