Cloud Integrated Infrastructure Engineer III
deloitte
Job Description
As a Cloud Integrated Infrastructure Engineer III on the Hybrid Cloud Infrastructure team, you will be responsible for…
- Contribute to reference architectures for artificial intelligence and high-performance computing infrastructure across compute, network, storage, platform, and software layers in edge, data center, and hybrid environments
- Translate business requirements into scalable, secure, and cost-optimized solutions while supporting architecture, design, and integration decisions
- Configure and implement NVIDIA platforms, graphics processing unit clusters, orchestration layers, and hybrid infrastructure components for artificial intelligence and high-performance computing workloads
- Develop infrastructure as code and automation using Terraform, Ansible, and GitOps, and support observability, site reliability engineering, resilience, and security practices
- Troubleshoot graphics processing unit, hardware, connectivity, and software issues, and collaborate with cross-functional teams to support delivery quality and operational outcomes
The team
At Hybrid Cloud Infrastructure, we deliver solutions spanning Hybrid Cloud, Advanced Connectivity, AI Data Centers, High-Performance Computing, and AI Infrastructure to help clients achieve their desired outcomes. Our offerings include engineered transformation services for hybrid cloud infrastructure and platforms, prioritizing resiliency, optimization, and extensive automation. We integrate Advanced Connectivity, with AI Infrastructure and AI to boost operational efficiency and enable real-time data processing, crucial for critical low-latency enterprise operational technology (OT) applications. Additionally, we provide comprehensive management of all facets of operations for hybrid cloud infrastructure and field operations.
Location: Bengaluru/Hyderabad/Pune
Shift Timings: As per business requirements
Qualifications
Required:
- 6-9 years of experience in infrastructure engineering or implementation for large-scale platforms, including design, implementation, operations, and optimization
- Experience building or supporting graphics processing unit-accelerated platforms for artificial intelligence, machine learning, or high-performance computing workloads
- Experience with Linux system administration in production environments
- Experience deploying or operating distributed compute clusters for artificial intelligence or high-performance computing in hybrid cloud environments, including multi-graphics processing unit configurations, scheduler integration, and edge-to-cloud scaling
- Experience with high-performance networking or storage for artificial intelligence or high-performance computing
- Experience building containerized platforms using Kubernetes or Red Hat OpenShift, including graphics processing unit operators, drivers, CUDA container runtime, and cluster lifecycle automation
- Experience automating infrastructure as code using Terraform and Ansible
Preferred:
- Experience implementing artificial intelligence or high-performance computing cluster scheduling using Slurm and Kubernetes, including multi-tenant queues, quotas, and graphics processing unit-aware policies
- Experience supporting generative artificial intelligence infrastructure patterns, including multi-node distributed training
- Experience with artificial intelligence agents and frameworks
- Experience with high-throughput storage for artificial intelligence or high-performance computing
- Exposure to pre-sales or sales engineering activities, including discovery sessions, solution demonstrations, and proposal or request for proposal contributions
- Hands-on involvement in at least one end-to-end deployment of reference architecture in cloud or on-premises environments, including security controls, network segmentation, operational runbooks, and validation testing