Senior Architect, ML Engineering
oraclecloud
Job Description
-
Architecture & Technical Leadership
- Own the end-to-end architecture for RAG + agentic workflows (Plan → Execute → Verify) across enterprise use cases (contracts, PDFs, knowledge bases).
- Define architecture standards for multi-tenant isolation, API design, service boundaries, and integration patterns.
- Lead technical decision-making: build vs buy, model strategy (hosted vs open-weights), tooling selection, and performance/cost tradeoffs.
- Drive architecture reviews, mentor engineers/researchers, and raise the overall bar for engineering quality and research rigor.
- RAG & Retrieval Systems (Enterprise-grade)
- Design retrieval pipelines that optimize grounded accuracy: chunking strategy, hybrid retrieval, reranking, query rewriting, and context construction.
- Define document ingestion patterns (PDF parsing, OCR, structured extraction, metadata enrichment) and index lifecycle strategies.
- Establish retrieval evaluation and regression frameworks (ground truth, offline/online evaluation, drift tracking).
-
Enable async and event-driven architectures for long-running tasks using queues/streams (Kafka/RabbitMQ/Redis Streams) and/or durable workflow engines (Temporal).
- Inference & Platform Engineering
- Architect model serving for high throughput and low latency using engines like vLLM / TGI / Triton / TorchServe (as applicable).
- Define GPU orchestration and capacity strategy on Kubernetes (AKS/EKS/GKE), including scale-to-zero, scheduling, and quota-based governance.
- Design platform-level controls for rate limiting, caching, backpressure, and cost containment (tenant quotas, token budgets, throttling).
- Safety, Guardrails, Security & Compliance
- Own guardrail architecture for prompt injection defense, tool safety, policy enforcement, and PII handling (redaction patterns).
- Define secure-by-default patterns: secrets management, data protection, audit logs, and safe prompt/tool execution boundaries.
- Partner with security/compliance teams to meet enterprise standards (e.g., SOC2/GDPR expectations where relevant).
- Observability, Reliability & Operational Excellence
- Establish SLOs and production readiness standards: error budgets, runbooks, incident response patterns.
- Define observability strategy across LLM calls and agent tools: tracing, metrics, logs, cost dashboards, and token usage reporting.
- Build reliability patterns for dependency failure (model provider downtime, throttling): circuit breakers, fallbacks, degradation strategies.