Senior Staff AI/ML Scale Engineer

myworkdayjobs

Hyderabad, India 4 Years Exp Posted 70d ago

Job Description

  • Simulation & Modeling – Implement workflows to study AI/ML workloads using trace-driven and analytical models.

  • Performance Analysis – Profile and analyze system bottlenecks across compute, memory, and network layers.

  • Networking Studies – Evaluate collective communication performance (all-reduce, all-to-all, reduce-scatter) across different topologies and fabrics.

  • Tooling & Automation – Develop utilities for trace generation, merging, conversion, and visualization.

  • Prototype & Validation – Test distributed training and inference pipelines in simulated and real environments.

  • Hardware/Software Co-Design – Collaborate on emerging technologies (CXL, DPUs, NVLink, PCIe, UET/UEC, in-network compute).

  • Scaling Studies – Conduct performance projections and trade-off studies for next-gen AI infrastructure.

  • Knowledge Sharing – Document workflows, publish internal reports, and drive peer learning.

 

What We're Looking For

  • Bachelor’s, Master’s, or PhD in Computer Science, Electrical Engineering, or related field with 4–12 years of relevant professional experience.

  • Strong foundation in computer architecture, distributed systems, AI/ML, and operating systems.

  • Solid networking fundamentals including TCP/IP, RDMA, RoCE, UET/UEC, and switching/routing.

  • Experience with simulation frameworks (e.g., Astra-Sim, Chakra, gem5, SST, NS-3).

  • Hands-on with PyTorch/TensorFlow and distributed training frameworks (DDP, Horovod, DeepSpeed).

  • Strong programming skills in Python, C++, and scripting for automation.

  • Familiarity with interconnect and memory technologies (CXL, PCIe, NVLink, UAL).

  • Experience with profiling, telemetry, observability, and debugging tools.

  • Knowledge of collective communication algorithms and topology-aware scheduling.

  • Exposure to AI accelerators, memory disaggregation, DPUs, and custom silicon.

  • Familiarity with visualization tools (Perfetto, Chrome Tracing, Chakra Timeline, Flamegraphs).

  • Experience with large-scale AI training pipelines and scaling studies.

  • Interest in energy/performance trade-offs and resilience techniques.

     

 

Similar Openings for You