Senior Machine Learning Engineer
expediagroup
Job Description
In this role, you will:
-
Design and own high-throughput, low-latency ML systems (2000+ RPS) for TravelAds, including multi-service training and serving architectures, auction and ranking models, and real-time inference services that meet strict sub-100ms SLAs.
-
Build and evolve ML infrastructure and data foundations – feature stores, online/offline feature pipelines, embedding and vector services, and data lineage and versioning – that power ad relevance, bidding optimization, experimentation, and model evaluation at scale.
-
Accelerate the end-to-end ML lifecycle by automating training, validation, deployment, shadow testing, A/B testing, and retraining using orchestrated workflows (e.g., Flyte, Airflow) and robust quality gates.
-
Develop agentic AI and LLM/RAG-powered workflows that automate ML operations (training, deployment, validation, monitoring, calibration) and enable AI-assisted dataset creation, operational analysis, and decision support.
-
Define and implement ML observability, reliability, and cost guardrails through drift and feature-freshness monitoring, health dashboards, SLO/SLI definitions, incident response, and resilience-focused improvements.
-
Safely integrates and operates AI/ML-enabled solutions that improve outcomes, while setting technical direction, mentoring MLEs to operate independently, and leading cross-team initiatives that elevate ML engineering practices and business impact.
Minimum Qualifications:
-
Bachelor’s degree in Computer Science or a related technical field; or Equivalent related professional experience.
-
8+ years of relevant professional experience.
-
Proven track record of designing, building, and operating production ML or large-scale distributed systems, including system design (HLD/LLD), serving stacks, monitoring and observability, rollbacks, and operational rigor.
-
Strong software engineering foundation in Python and at least one of Java/Kotlin/Scala, with deep understanding of distributed systems, data structures, and performance optimization.
-
Experience leading technical design for multi-quarter ML projects and partnering with Product and business stakeholders to define problems, make clear trade-offs, and measure the business impact of ML systems.
Preferred Qualifications:
-
Experience with real-time ML inference at high throughput (1000+ RPS or more) and strict latency SLAs.
-
Expertise with big data technologies such as Spark, Hive, Databricks and workflow orchestration tools such as Airflow and Flyte, as well as cloud-native ML platforms and infrastructure (e.g., AWS SageMaker, EKS, EMR, Docker).
-
Experience building ML lifecycle automation – CI/CD for ML, automated training pipelines, deployment orchestration, and robust data lineage and versioning – plus ML observability systems including drift detection, feature-freshness monitoring, model health dashboards, and offline/online parity validation.
-
Track record of leading incident response and root cause analysis for ML or other mission-critical services, and driving sustained improvements in reliability, resilience, and operational excellence.
-
Familiarity with AI-driven systems, tools, or workflows and applying AI/ML concepts to improve real-world products and engineering outcomes, including experience with LLM productionization, RAG architectures, or agentic AI workflows in high-scale environments.
-