Spydra - Performance Engineer

hirist

Bangalore 6 Years Exp Posted 30d ago

Job Description

What youll do :

- Engineer and tune the Dynamo-based model deployment engine backend selection, disaggregated prefill / decode, wide-EP, KV-cache routing.

- Tune vLLM / SGLang runtime on AMD ROCm and NVIDIA GPUs batch shapes, attention kernels, CUDA / HIP graphs, paged-attention block size, chunked prefill.

- Own quantisation policy (FP8 E4M3, INT4, AWQ / GPTQ) and the accuracy-vs-throughput trade-off per model family.

- Build the Envoy filter chain that fronts the inference endpoint auth, rate-limit, request shaping, observability, retries, circuit-breaking.

- Integrate LLM guardrails (llm-guard / NeMo Guardrails / open-source equivalents) for prompt filtering, PII redaction, jailbreak / toxicity detection, and policy enforcement at the edge.

- Stand up and run the benchmark harness (AIPerf / locust-llm / custom) regression suites that gate every Dynamo / vLLM / guardrail release.

- Profile end-to-end with nsys / rocprof / pyroscope; identify and eliminate stalls in the serving path.

- Publish an SLO dashboard (TTFT p95, ITL p95, tokens / GPU-second, $/Mtok) and own it through launches.

Must have :

- Strong hands-on with at least one inference runtime vLLM, SGLang, TensorRT-LLM, TGI, or Triton in production.

- Working knowledge of transformer internals attention, KV cache, rotary embeddings, MoE routing, speculative decoding.

- GPU profiling and kernel-level debugging (nsys / nvprof / rocprof / hip-clang). Comfortable reading CUDA / HIP code.

- Envoy / service-mesh production experience rate-limit service, ext_authz, Wasm filters.

- Python and C++; comfortable shipping patches upstream to vLLM / SGLang / Dynamo when needed.

- Solid DevOps fundamentals containers, Helm, GitOps, CI/CD for model / engine releases.
 

Similar Openings for You