Sr Software Engineer, Data & AI Platform

dolby

Bangalore NM Years Exp Posted 41d ago

Key Responsibilities:

Design and build platform primitives—Python SDKs, platform APIs, and templates—that enable reproducible experiments, configuration-as-code workflows, model lineage, and artifact tracking, which enable seamless promotion from research to production.
Create developer tools to elevate development experience—CLIs, UI, dashboards, visualization layers—that simplify platform operation and multi-stage workflows.
Implement and scale distributed training systems (multi-node GPU workloads) on top of Kubernetes and cloud-based orchestration foundation.
Build large-scale evaluation frameworks for offline tests, shadow deployments, and A/B experimentation.
Implement model/dataset versioning, approvals, lineage tracking, retention, and compliance hooks.
Partner with AI/ML research, platform engineering/MLOps and infrastructure, and data engineering teams to generalize workflows into reusable frameworks.
Partner with platform engineering/MLOps and infrastructure to define observability stacks for metrics, drift indicators, performance regressions, training/inference health signals, production reliability (SLIs/SLOs), monitoring, and incident response.

What you need to succeed

Desired Background:

BS in Computer Science, Mathematics, Engineering, or equivalent technical field. Master’s preferred.
Proven track record building large-scale distributed systems and integrated data and AI/ML platforms (e.g., training, serving, workflow orchestration, data pipelines).
Expert-level proficiency in Python and one of Go/Java/C++ and building production-grade services/APIs/SDKs
Extensive hands-on experience with Kubernetes (EKS, GKE, self-hosted, etc) including autoscaling and job scheduling frameworks, GPU infrastructure, and AI/ML-related AWS/GCP managed services (VertexAI, SageMaker, etc).
Deep expertise with AI/ML ecosystem and tooling such as PyTorch, TensorFlow, Ray, experiment/feature/model stores (MLFlow, WnB, Feast, etc), Hugging Face
Proven ability to scale AI/ML workloads and pipelines—pipeline SDKs, feature/model CI/CD, automated evaluation, safe rollouts, monitoring
Strong developer-experience mindset: ability to translate research/engineering friction into elegant APIs, templates, and tools that reduce time-to-first-successful remote run and raise platform adoption.

Preferred Skill:

Previous experience with Databricks.
Knowledge of multimodal AI/ML (audio, video, text) data preparation, feature extraction, model development, training, and evaluation workflows.
Experience with LLM/foundation model sizing/estimation, training requirements, pipelines, and deployment.
Knowledge of LLM/foundation model sizing/estimation, training requirements, evaluation workflows and orchestration and deployment patterns.
- Experience designing feature stores or embedding services tightly integrated with training pipelines.