Infrrd.ai - Senior Data Scientist - LLM/Artificial Intelligence
hirist
Job Description
Job Duties and Responsibilities :
- Design and build agentic evaluation pipelines : Error detection root cause hypothesis generation prompt variant testing A/B measurement production promotion, with minimal human intervention.
- Own the accuracy measurement infrastructure : Automate error analysis, data quality pipelines, and batch evaluation frameworks across document types and customer configurations.
- Build and evolve internal accuracy tooling from manual utilities into automated improvement platforms - classification and extraction correction loops, NTP rule generation, performance reporting.
- Take prototype methodologies and productionize them into reliable, scalable systems the team can operate independently.
- Build LLM-based extraction and classification pipelines using few-shot and RAG strategies for complex, real-world document types.
- Design and maintain A/B testing infrastructure for prompt and model changes - no untested changes go to production.
- Create live dashboards tracking extraction accuracy, NTP rates, and false positive rates across document types and customer configurations.
- Optimize LLM costs while maintaining quality : prompt compression, output token minimization, model selection and migration strategies.
- Write production-grade data pipelines with error handling, retries, logging, and monitoring.
- Collaborate with platform engineering and applied research functions on architecture and methodology translation.
- Mentor 1 - 2 junior engineers; build tooling and documentation they can operate independently.
Required Qualifications :
- BE / MTech in Computer Science, AI/ML, Computational Data Science (CDS), Computer Science & Automation (CSA), or related discipline.
Experience Range :
- 8 - 10 years total; minimum 4 - 6 years building production LLM or AI systems; minimum 4-6 years in evaluation, quality measurement, or accuracy improvement work.
"Must-have" Skills :
- Production-grade Python - clean, tested, maintainable systems; not just scripts (pytest, FastAPI or Flask)
- Hands-on LLM API experience (OpenAI, Anthropic, Gemini, AWS Bedrock or equivalent) with
systematic, measurement-driven prompt engineering - methodology over instinct
- Agentic pipeline design - multi-step reasoning, tool use, orchestration frameworks (LangChain, LlamaIndex or equivalent), automated evaluation and feedback loops
- Evaluation framework design for LLM systems - precision/recall/F1, confusion matrices, A/B testing, per-class error analysis
- Analytical depth sufficient to design meaningful accuracy metrics and interpret why a model fails on a specific document or field type
- MongoDB or equivalent NoSQL - queries, aggregations, indexing pandas / numpy for data processing and batch analysis
- Git, code reviews, CI/CD basics (GitHub Actions or Jenkins)
- Clear written communication - able to explain model behaviour and accuracy findings to non-technical stakeholders
"Would-be-nice" Skills :
- Document AI : PDF parsing, layout-aware extraction, OCR, structured form extraction
- RAG pipeline design and vector search (Pinecone, Weaviate, or similar)
- Classification systems with large label spaces (50+ classes)
- Async Python (asyncio, aiohttp) for pipeline throughput
- Embedding models and semantic similarity for document matching
- Prior experience working alongside a Research or Applied Science team as the engineering
counterpart
Working Knowledge (Tools) :
- Python, FastAPI / Flask, MongoDB, Git, GitHub Actions / Jenkins, LLM APIs (OpenAI / Anthropic / Gemini or equivalent), LangChain / LlamaIndex, Pandas / Numpy, Pytest, Docker
General Knowledge :
- NLP concepts, LLM prompt engineering patterns, REST APIs, RAG pipelines, vector databases, JSON data structures