Senior Staff AI/ML Scale Engineer
myworkdayjobs
Job Description
-
Simulation & Modeling – Implement workflows to study AI/ML workloads using trace-driven and analytical models.
-
Performance Analysis – Profile and analyze system bottlenecks across compute, memory, and network layers.
-
Networking Studies – Evaluate collective communication performance (all-reduce, all-to-all, reduce-scatter) across different topologies and fabrics.
-
Tooling & Automation – Develop utilities for trace generation, merging, conversion, and visualization.
-
Prototype & Validation – Test distributed training and inference pipelines in simulated and real environments.
-
Hardware/Software Co-Design – Collaborate on emerging technologies (CXL, DPUs, NVLink, PCIe, UET/UEC, in-network compute).
-
Scaling Studies – Conduct performance projections and trade-off studies for next-gen AI infrastructure.
-
Knowledge Sharing – Document workflows, publish internal reports, and drive peer learning.
What We're Looking For
-
Bachelor’s, Master’s, or PhD in Computer Science, Electrical Engineering, or related field with 4–12 years of relevant professional experience.
-
Strong foundation in computer architecture, distributed systems, AI/ML, and operating systems.
-
Solid networking fundamentals including TCP/IP, RDMA, RoCE, UET/UEC, and switching/routing.
-
Experience with simulation frameworks (e.g., Astra-Sim, Chakra, gem5, SST, NS-3).
-
Hands-on with PyTorch/TensorFlow and distributed training frameworks (DDP, Horovod, DeepSpeed).
-
Strong programming skills in Python, C++, and scripting for automation.
-
Familiarity with interconnect and memory technologies (CXL, PCIe, NVLink, UAL).
-
Experience with profiling, telemetry, observability, and debugging tools.
-
Knowledge of collective communication algorithms and topology-aware scheduling.
-
Exposure to AI accelerators, memory disaggregation, DPUs, and custom silicon.
-
Familiarity with visualization tools (Perfetto, Chrome Tracing, Chakra Timeline, Flamegraphs).
-
Experience with large-scale AI training pipelines and scaling studies.
-
Interest in energy/performance trade-offs and resilience techniques.