Principal Machine Learning Engineer
spglobal
Job Description
Responsibilities and Impact:
-
LLM & Generative AI Engineering: Deploy and architect production-scale LLM systems spanning frontier models (GPT-4 class), open-source variants (such as LLaMA, Mistral, Gemma), RAG pipelines, and multi-modal AI systems incorporating text, code, images, and structured data.
-
Agentic AI Systems: Design and operationalize autonomous AI agents with multi-agent orchestration, tool-use capabilities, memory management, and enterprise-grade guardrails and observability strategies.
-
Python & Software Engineering: Write high-performance Python code following SOLID principles, lead code reviews, build reusable AI libraries, and implement rigorous testing and CI/CD practices across all ML workloads
-
Cloud & Distributed Systems: Architect cloud-native AI infrastructure with GPU cluster management, auto-scaling inference endpoints, vector databases, and cost-optimized distributed systems for high-throughput model serving, leveraging managed AI services (such as Bedrock, Azure OpenAI, Vertex AI) alongside self-hosted deployments (such as vLLM, TGI).
-
Backend APIs & Systems Integration: Design production-grade RESTful and asynchronous APIs (similar to FastAPI, gRPC) exposing AI capabilities, integrate LLM services with enterprise systems, and own end-to-end performance, reliability, and security from design through production
-
MLOps & LLMOps: Implement comprehensive ML pipelines for training through monitoring tools (similar to MLflow, Kubeflow, SageMaker), establish prompt versioning and model governance practices, and instrument production systems with observability across performance and quality metrics
-
DevOps & Platform Engineering: Embed AI workloads into CI/CD pipelines, champion containerization (such as Docker, Helm) and GitOps workflows, define SRE practices for ML systems, and drive platform standardization for self-service AI capabilities
-
Organization-Wide AI Transformation: Advise engineering, product and business leadership on AI strategy and build-vs-buy decisions, evaluate third-party tooling, define transformation KPIs, and partner with governance teams to establish responsible AI policies and regulatory frameworks.
Basic Required Qualifications:
-
10+ years of progressive experience, with 8+ years in data science, data analytics, machine learning engineering, or similar roles.
-
Proven ability to translate complex technical concepts for non-technical audiences with clarity and impact.
-
Experience defining technical roadmaps, architecture decision records (ADRs), and engineering standards adopted across multiple teams.
-
History of mentoring senior and mid-level engineers, conducting effective technical interviews, and raising the organizational engineering bar.
-
LLM Frameworks: Extensive knowledge and experience in tools similar to LangChain, LlamaIndex, LangGraph, Hugging Face Transformers, PEFT, vLLM, Ollama, or equivalent production-grade tooling.
-
MLOps Tooling: Extensive knowledge and experience in tools similar to MLflow, SageMaker, Vertex AI, or Kubeflow — with a bias toward automation and repeatability.
-
Cloud Platforms: Deep expertise in cloud platforms such as AWS, GCP, or Azure.
-
Python: Expert-level proficiency including async programming, performance optimization, type systems, packaging, and internal library authorship.
-
Databases & Storage: Vector databases (similar to Pinecone, OpenSearch, Chroma), relational (such as PostgreSQL), NoSQL (such as Redis, DynamoDB), and object storage.
-
Containerization & Orchestration expertise in environments similar to Docker, Helm.
-
Backend Development: Expertise&nb