Applied AI Engineer

ycombinator

Bengaluru, India 3 Years Exp Posted 31d ago

What you'll do

Build and maintain the eval framework that scores voice agent quality end-to-end transcription, response quality, TTS, and full-conversation outcomes
Design voice agent behavior: system prompts, tool use, conversation flow, error recovery, and guardrails for real-time interactions
Drive transcription accuracy improvements across STT providers and configurations (Deepgram, Whisper, AssemblyAI, Nvidia, etc.)
Drive TTS quality improvements voice selection, latency vs. fidelity tradeoffs, prosody, edge cases
Curate and grow our evaluation datasets, including hard-case mining from production traffic
Run rigorous A/B experiments and report results that the team can actually act on
Partner with backend engineers to wire eval signals into CI so regressions get caught before they ship

Must-haves

ML engineering experience shipping production systems
Strong Python and a working ML stack (PyTorch, Huggingface, pandas, scikit-learn)
Hands-on experience designing LLM-based agents: prompting, tool/function calling, multi-turn state, structured outputs
Hands-on experience building evals or eval frameworks for ML, LLM, or voice systems. Built LLM-as-judge eval pipelines and know their failure modes
Practical experience with ASR/STT comparing providers, fine-tuning, or running open models like Whisper
Practical experience with TTS systems (ElevenLabs or open models)
Comfortable working with audio data: sample rates, codecs, noise, alignment

Nice-to-haves

Designed voice agents specifically handled barge-in, interruption recovery, disfluencies, and natural turn-taking at the prompt/behavior layer
Experience with diarization, VAD, or endpointing models
Audio dataset curation, labeling, or annotation pipelines
Trained or fine-tuned ASR or TTS models from scratch or on domain audio
Experience with active learning or data-flywheel patterns over production traffic
Open-source contributions to AI/ML frameworks
- Familiarity with cost/latency tradeoffs across model providers for real-time voice