AI Engineer
ibm
Job Description
- Enable and optimize LLMs for training and inference on IBM Z, GPUs, and AI accelerators
- Drive performance improvements (latency, throughput, memory efficiency) for production workloads
- Implement LLM optimizations such as KV cache management, efficient attention, and optimized execution strategies
- Evaluate and validate LLMs at model-level and ops-level to ensure functional correctness, numerical accuracy, and model quality
- Evaluate LLMs using quality and benchmarking frameworks (RAGAS, DeepEval, etc.)
- Analyze and optimize tensor shapes, strides, and memory layouts to ensure efficient and correct execution across PyTorch and accelerator backends
- Build and scale distributed training and inference systems across multi-GPU and multi-node environments
- Develop high-performance kernels (CUDA/Triton) for compute-intensive workloads such as attention and quantization
- Profile and debug performance using PyTorch Profiler, TensorBoard, and system-level tools, focusing on compute, memory, and communication bottlenecks
- Build and maintain scalable infrastructure (Docker, Kubernetes) for reproducible and stable deployments
- Collaborate with compiler and backend teams, contribute to PyTorch ecosystem (TorchDynamo, TorchInductor)
Required education
Bachelor's Degree
Preferred education
Bachelor's Degree
Required technical and professional expertise
- 3+ years of experience in AI/ML systems, deep learning, or performance engineering
- Strong programming skills in Python (must) and working knowledge of C++
- Strong understanding of PyTorch internals (Autograd, ATen, Dispatcher) and exposure to compiler stack (TorchDynamo, TorchInductor, torch.compile)
- Good understanding of LLM architectures (Transformers, attention variants, KV cache, and efficient attention techniques such as Flash Attention or Paged Attention)
- Experience in model optimization and performance tuning (latency, throughput, memory)
- Strong understanding of tensor operations (shapes, strides, memory layouts) and their impact on execution
- Experience with distributed training/inference frameworks (FSDP, DeepSpeed, or similar)
- Familiarity with multi-GPU / multi-node environments and parallel execution
- Experience in profiling and debugging using tools like PyTorch Profiler, TensorBoard, or similar
- Good understanding of LLM evaluation and validation (performance and quality metrics)
- Experience with Linux environments and containerization (Docker)
- Strong problem-solving skills with ability to debug complex system-level and model-level issues