Senior ML Platform Architect

zohorecruit

Bangalore, 5 Years Exp Posted 89d ago

Multi-Cloud Enablement & ML Strategy

Vendor-Agnostic Architecture: Design and implementation of a multi-cloud ML strategy (AWS, GCP, Azure) to prevent vendor lock-in and optimize for global availability and cost efficiency.
Unified Model Orchestration: Architecture of abstract infrastructure layers using Kubernetes to ensure ML training and inference workloads port across
different cloud providers without code modification.
Global Connectivity: Establishment of cross-cloud networking and identity standards to maintain a consistent security posture and data access layer across all environments.

MCP (Model Context Protocol) Server Foundation

Contextual Architecture: Building and scaling the MCP Server foundation, enabling the decoupling of AI reasoning from tool execution for enhanced modularity.
Standardized Integration: Design of universal adapter layers that allow Large Language Models (LLMs) to securely access external databases, APIs, and internal file systems through standardized protocols.
Governance & Discovery: Architecture of centralized discovery services for
MCP servers to allow AI agents to dynamically find and invoke capabilities with strict audit trails.

High-Throughput Message Bus & Data Flow

Event-Driven AI Backbone: Design and implementation of a low-latency, high-throughput message bus (e.g., Kafka, Pulsar) to handle real-time data streaming and asynchronous ML pipeline triggers.
Scalable Feature Distribution: Architecture of the backbone for streaming features and model events, ensuring high-volume inference logs and telemetry data are ingested with zero data loss.
System Decoupling: Utilization of the message bus to decouple ML microservices, increasing the horizontal scalability and fault tolerance of the AI platform.

Strengthening the Core Application Layer

Leadership of the Security, Resilience, and Quality of Release chapter:
Security: Implementation of Zero-Trust architecture for AI workloads, including model weight encryption, secure secret management, and protection against adversarial attacks.
Resilience: Design of self-healing systems, multi-region failover strategies, and high-availability ML services to ensure mission-critical uptime.
Quality of Release: Establishment of automated, architecture-level release gates including performance benchmarking, security scanning, and automated canary/blue-green deployment strategies.