Senior ML Platform Architect
zohorecruit
Job Description
Multi-Cloud Enablement & ML Strategy
- Vendor-Agnostic Architecture: Design and implementation of a multi-cloud ML strategy (AWS, GCP, Azure) to prevent vendor lock-in and optimize for global availability and cost efficiency.
- Unified Model Orchestration: Architecture of abstract infrastructure layers using Kubernetes to ensure ML training and inference workloads port across
- different cloud providers without code modification.
- Global Connectivity: Establishment of cross-cloud networking and identity standards to maintain a consistent security posture and data access layer across all environments.
MCP (Model Context Protocol) Server Foundation
- Contextual Architecture: Building and scaling the MCP Server foundation, enabling the decoupling of AI reasoning from tool execution for enhanced modularity.
- Standardized Integration: Design of universal adapter layers that allow Large Language Models (LLMs) to securely access external databases, APIs, and internal file systems through standardized protocols.
- Governance & Discovery: Architecture of centralized discovery services for
- MCP servers to allow AI agents to dynamically find and invoke capabilities with strict audit trails.
High-Throughput Message Bus & Data Flow
- Event-Driven AI Backbone: Design and implementation of a low-latency, high-throughput message bus (e.g., Kafka, Pulsar) to handle real-time data streaming and asynchronous ML pipeline triggers.
- Scalable Feature Distribution: Architecture of the backbone for streaming features and model events, ensuring high-volume inference logs and telemetry data are ingested with zero data loss.
- System Decoupling: Utilization of the message bus to decouple ML microservices, increasing the horizontal scalability and fault tolerance of the AI platform.
Strengthening the Core Application Layer
- Leadership of the Security, Resilience, and Quality of Release chapter:
- Security: Implementation of Zero-Trust architecture for AI workloads, including model weight encryption, secure secret management, and protection against adversarial attacks.
- Resilience: Design of self-healing systems, multi-region failover strategies, and high-availability ML services to ensure mission-critical uptime.
- Quality of Release: Establishment of automated, architecture-level release gates including performance benchmarking, security scanning, and automated canary/blue-green deployment strategies.