Senior ML Platform Architect

zohorecruit

Bangalore, 5 Years Exp Posted 40d ago

Job Description

Multi-Cloud Enablement & ML Strategy

  • Vendor-Agnostic Architecture: Design and implementation of a multi-cloud ML strategy (AWS, GCP, Azure) to prevent vendor lock-in and optimize for global availability and cost efficiency.
  • Unified Model Orchestration: Architecture of abstract infrastructure layers using Kubernetes to ensure ML training and inference workloads port across
  • different cloud providers without code modification.
  • Global Connectivity: Establishment of cross-cloud networking and identity standards to maintain a consistent security posture and data access layer across all environments.

 

MCP (Model Context Protocol) Server Foundation

 

  • Contextual Architecture: Building and scaling the MCP Server foundation, enabling the decoupling of AI reasoning from tool execution for enhanced modularity.
  • Standardized Integration: Design of universal adapter layers that allow Large Language Models (LLMs) to securely access external databases, APIs, and internal file systems through standardized protocols.
  • Governance & Discovery: Architecture of centralized discovery services for
  • MCP servers to allow AI agents to dynamically find and invoke capabilities with strict audit trails.

 

High-Throughput Message Bus & Data Flow

 

  • Event-Driven AI Backbone: Design and implementation of a low-latency, high-throughput message bus (e.g., Kafka, Pulsar) to handle real-time data streaming and asynchronous ML pipeline triggers.
  • Scalable Feature Distribution: Architecture of the backbone for streaming features and model events, ensuring high-volume inference logs and telemetry data are ingested with zero data loss.
  • System Decoupling: Utilization of the message bus to decouple ML microservices, increasing the horizontal scalability and fault tolerance of the AI platform.

 

Strengthening the Core Application Layer

 

  • Leadership of the Security, Resilience, and Quality of Release chapter:
  • Security: Implementation of Zero-Trust architecture for AI workloads, including model weight encryption, secure secret management, and protection against adversarial attacks.
  • Resilience: Design of self-healing systems, multi-region failover strategies, and high-availability ML services to ensure mission-critical uptime. 
  • Quality of Release: Establishment of automated, architecture-level release gates including performance benchmarking, security scanning, and automated canary/blue-green deployment strategies.

Similar Openings for You