Data Science Professional

bt

Bengaluru, India NM Years Exp Posted 41d ago

Job Description

What you’ll be doing

RAG Pipeline Implementation
•    Implement document ingestion pipeline stages: PDF parser (PyMuPDF — table extraction, heading detection), DOCX parser (python-docx — structure preservation), HTML cleaner (BeautifulSoup — boilerplate removal), metadata extractor (doc_id, page_number, section_heading, source_url)
•    Implement chunking strategies and expose as configurable per-KB parameter: fixed-size with configurable overlap, semantic (sentence boundary detection using SpaCy sentencizer), recursive (hierarchical structure-aware splitting), custom (user-defined regex split pattern)
•    Build the embedding generation pipeline: sentence-transformers (BGE-large, E5-mistral) batch inference, async pipeline with queue-based worker pool, embedding dimension validation (768–3072), storage to pgvector with HNSW index maintenance
•    Implement the hybrid retrieval pipeline: pgvector cosine similarity query (top-k dense), Elasticsearch BM25 query (top-k sparse), Reciprocal Rank Fusion score normalisation and merge, result deduplication by chunk_id

Guardrail ML Model Integration
•    Integrate Presidio + custom SpaCy NER into the input PII detection pipeline: load custom NER model trained on enterprise entity types, configure recogniser registry, implement redaction with entity-type-specific masking (e.g. EMPLOYEE_ID → [ID_REDACTED])
•    Integrate DistilBERT prompt injection classifier: ONNX runtime inference for low-latency serving, threshold configuration (0.85 block / 0.50–0.85 flag), batch inference for high-throughput scenarios, model update workflow without pod restart
•    Integrate Detoxify output toxicity model: multilabel classification (toxicity, severe_toxicity, obscene, threat, insult, identity_attack), per-label threshold configuration, structured result payload to audit log

Memory and Embeddings
•    Implement episodic memory read/write: encode user interaction into embedding (sentence-transformers), store to pgvector with user_id + agent_id + timestamp metadata, similarity search for memory recall (top-3 most relevant past interactions), TTL-based pruning job
•    Implement organisational memory entity extraction: SpaCy NER pipeline for entity identification (people, projects, products, policies) from agent conversations, entity deduplication, Neo4j node/relationship upsert, Cypher query interface for graph-augmented retrieval

Evaluation and Experimentation
•    Implement RAGAS metric computation jobs: faithfulness, answer_relevancy, context_precision, context_recall — using RAGAS library against sampled invocations from ClickHouse, persist scores to eval_results table with agent_id + version + timestamp
•    Build golden dataset management pipeline: curate high-quality invocations (RAGAS faithfulness ≥ 0.85) from ClickHouse into golden_dataset table, versioned golden sets per agent, diff comparison between golden set versions
•    Run offline experiments: A/B compare chunking strategies, embedding models, retrieval k values, re-ranker models — track metrics (MRR@k, faithfulness, retrieval latency) in MLflow or ClickHouse experiment tracking, report findings to Lead with recommendation
 

Essential Skills / Experience

Systems Architecture
•    Distributed systems design - CAP theorem trade-offs, eventual vs strong consistency selection per use case (CockroachDB for Cedar policies, Redis for budget counters), partition tolerance in multi-node ML serving, failure isolation between pipeline stages
•    Multi-tenant SaaS architecture -per-tenant data isolation in vector stores (pgvector namespace by workspace_id), tenant-scoped RAGAS baselines, Restricted data routing enforcement (Cedar hard-deny to cloud LLM), resource quota enforcement via Kubernetes LimitRange
•    Event-driven architecture - Kafka event sourcing for audit trail, NATS JetStream for real-time policy invalidation (<5s), async embedding ingestion pipeline via Kafka consumer group, evaluation result streaming to ClickHouse
•    API gateway patterns 
•    Microservices communication 

LLM Orchestration and Agentic Patterns
•    Deep knowledge of LLM orchestration patterns - prompt assembly (system prompt + persona + memory context + RAG context + conversation history + user input), context window management, token budget allocation across pipe

Similar Openings for You