Applied AI Engineer
ycombinator
Job Description
What you'll do
- Build and maintain the eval framework that scores voice agent quality end-to-end transcription, response quality, TTS, and full-conversation outcomes
- Design voice agent behavior: system prompts, tool use, conversation flow, error recovery, and guardrails for real-time interactions
- Drive transcription accuracy improvements across STT providers and configurations (Deepgram, Whisper, AssemblyAI, Nvidia, etc.)
- Drive TTS quality improvements voice selection, latency vs. fidelity tradeoffs, prosody, edge cases
- Curate and grow our evaluation datasets, including hard-case mining from production traffic
- Run rigorous A/B experiments and report results that the team can actually act on
- Partner with backend engineers to wire eval signals into CI so regressions get caught before they ship
Must-haves
- ML engineering experience shipping production systems
- Strong Python and a working ML stack (PyTorch, Huggingface, pandas, scikit-learn)
- Hands-on experience designing LLM-based agents: prompting, tool/function calling, multi-turn state, structured outputs
- Hands-on experience building evals or eval frameworks for ML, LLM, or voice systems. Built LLM-as-judge eval pipelines and know their failure modes
- Practical experience with ASR/STT comparing providers, fine-tuning, or running open models like Whisper
- Practical experience with TTS systems (ElevenLabs or open models)
- Comfortable working with audio data: sample rates, codecs, noise, alignment
Nice-to-haves
- Designed voice agents specifically handled barge-in, interruption recovery, disfluencies, and natural turn-taking at the prompt/behavior layer
- Experience with diarization, VAD, or endpointing models
- Audio dataset curation, labeling, or annotation pipelines
- Trained or fine-tuned ASR or TTS models from scratch or on domain audio
- Experience with active learning or data-flywheel patterns over production traffic
- Open-source contributions to AI/ML frameworks
- Familiarity with cost/latency tradeoffs across model providers for real-time voice