Data Science Professional
bt
Job Description
What you’ll be doing
RAG Pipeline Implementation
• Implement document ingestion pipeline stages: PDF parser (PyMuPDF — table extraction, heading detection), DOCX parser (python-docx — structure preservation), HTML cleaner (BeautifulSoup — boilerplate removal), metadata extractor (doc_id, page_number, section_heading, source_url)
• Implement chunking strategies and expose as configurable per-KB parameter: fixed-size with configurable overlap, semantic (sentence boundary detection using SpaCy sentencizer), recursive (hierarchical structure-aware splitting), custom (user-defined regex split pattern)
• Build the embedding generation pipeline: sentence-transformers (BGE-large, E5-mistral) batch inference, async pipeline with queue-based worker pool, embedding dimension validation (768–3072), storage to pgvector with HNSW index maintenance
• Implement the hybrid retrieval pipeline: pgvector cosine similarity query (top-k dense), Elasticsearch BM25 query (top-k sparse), Reciprocal Rank Fusion score normalisation and merge, result deduplication by chunk_id
Guardrail ML Model Integration
• Integrate Presidio + custom SpaCy NER into the input PII detection pipeline: load custom NER model trained on enterprise entity types, configure recogniser registry, implement redaction with entity-type-specific masking (e.g. EMPLOYEE_ID → [ID_REDACTED])
• Integrate DistilBERT prompt injection classifier: ONNX runtime inference for low-latency serving, threshold configuration (0.85 block / 0.50–0.85 flag), batch inference for high-throughput scenarios, model update workflow without pod restart
• Integrate Detoxify output toxicity model: multilabel classification (toxicity, severe_toxicity, obscene, threat, insult, identity_attack), per-label threshold configuration, structured result payload to audit log
Memory and Embeddings
• Implement episodic memory read/write: encode user interaction into embedding (sentence-transformers), store to pgvector with user_id + agent_id + timestamp metadata, similarity search for memory recall (top-3 most relevant past interactions), TTL-based pruning job
• Implement organisational memory entity extraction: SpaCy NER pipeline for entity identification (people, projects, products, policies) from agent conversations, entity deduplication, Neo4j node/relationship upsert, Cypher query interface for graph-augmented retrieval
Evaluation and Experimentation
• Implement RAGAS metric computation jobs: faithfulness, answer_relevancy, context_precision, context_recall — using RAGAS library against sampled invocations from ClickHouse, persist scores to eval_results table with agent_id + version + timestamp
• Build golden dataset management pipeline: curate high-quality invocations (RAGAS faithfulness ≥ 0.85) from ClickHouse into golden_dataset table, versioned golden sets per agent, diff comparison between golden set versions
• Run offline experiments: A/B compare chunking strategies, embedding models, retrieval k values, re-ranker models — track metrics (MRR@k, faithfulness, retrieval latency) in MLflow or ClickHouse experiment tracking, report findings to Lead with recommendation
Essential Skills / Experience
Systems Architecture
• Distributed systems design - CAP theorem trade-offs, eventual vs strong consistency selection per use case (CockroachDB for Cedar policies, Redis for budget counters), partition tolerance in multi-node ML serving, failure isolation between pipeline stages
• Multi-tenant SaaS architecture -per-tenant data isolation in vector stores (pgvector namespace by workspace_id), tenant-scoped RAGAS baselines, Restricted data routing enforcement (Cedar hard-deny to cloud LLM), resource quota enforcement via Kubernetes LimitRange
• Event-driven architecture - Kafka event sourcing for audit trail, NATS JetStream for real-time policy invalidation (<5s), async embedding ingestion pipeline via Kafka consumer group, evaluation result streaming to ClickHouse
• API gateway patterns
• Microservices communication
LLM Orchestration and Agentic Patterns
• Deep knowledge of LLM orchestration patterns - prompt assembly (system prompt + persona + memory context + RAG context + conversation history + user input), context window management, token budget allocation across pipe