Senior Architect, ML Engineering

oraclecloud

pune 13 Years Exp Posted 1h ago

Job Description

  • Architecture & Technical Leadership

    • Own the end-to-end architecture for RAG + agentic workflows (Plan → Execute → Verify) across enterprise use cases (contracts, PDFs, knowledge bases).
    • Define architecture standards for multi-tenant isolation, API design, service boundaries, and integration patterns.
    • Lead technical decision-making: build vs buy, model strategy (hosted vs open-weights), tooling selection, and performance/cost tradeoffs.
    • Drive architecture reviews, mentor engineers/researchers, and raise the overall bar for engineering quality and research rigor.
  • RAG & Retrieval Systems (Enterprise-grade)
    • Design retrieval pipelines that optimize grounded accuracy: chunking strategy, hybrid retrieval, reranking, query rewriting, and context construction.
    • Define document ingestion patterns (PDF parsing, OCR, structured extraction, metadata enrichment) and index lifecycle strategies.
    • Establish retrieval evaluation and regression frameworks (ground truth, offline/online evaluation, drift tracking).
  • Enable async and event-driven architectures for long-running tasks using queues/streams (Kafka/RabbitMQ/Redis Streams) and/or durable workflow engines (Temporal).

  • Inference & Platform Engineering
    • Architect model serving for high throughput and low latency using engines like vLLM / TGI / Triton / TorchServe (as applicable).
    • Define GPU orchestration and capacity strategy on Kubernetes (AKS/EKS/GKE), including scale-to-zero, scheduling, and quota-based governance.
    • Design platform-level controls for rate limiting, caching, backpressure, and cost containment (tenant quotas, token budgets, throttling).
  • Safety, Guardrails, Security & Compliance
    • Own guardrail architecture for prompt injection defense, tool safety, policy enforcement, and PII handling (redaction patterns).
    • Define secure-by-default patterns: secrets management, data protection, audit logs, and safe prompt/tool execution boundaries.
    • Partner with security/compliance teams to meet enterprise standards (e.g., SOC2/GDPR expectations where relevant).
  • Observability, Reliability & Operational Excellence
    • Establish SLOs and production readiness standards: error budgets, runbooks, incident response patterns.
    • Define observability strategy across LLM calls and agent tools: tracing, metrics, logs, cost dashboards, and token usage reporting.
      • Build reliability patterns for dependency failure (model provider downtime, throttling): circuit breakers, fallbacks, degradation strategies.

Similar Openings for You