Senior Infrastructure Engineer - GenAI
citi
Job Description
Key Responsibilities:
-
Design and implement scalable backend services and APIs for generative AI applications using microservices architecture and cloud-native patterns.
-
Build and maintain model serving infrastructure with load balancing, auto-scaling, caching, and failover capabilities for high-availability AI services.
-
Deploy and orchestrate containerized AI workloads using Docker, Kubernetes, ECS, and OpenShift across development, staging, and production environments.
-
Develop serverless AI functions using AWS Lambda, ECS Fargate, and other cloud services for scalable, cost-effective inference.
-
Implement robust CI/CD pipelines for automated deployment of AI services, including model versioning and gradual rollout strategies.
-
Create comprehensive monitoring, logging, and alerting systems for AI service performance, reliability, and cost optimization.
-
Integrate with various LLM APIs (OpenAI, Anthropic, Google) and open-source models, implementing efficient batching and optimization techniques.
-
Build data pipelines for training data preparation, model fine-tuning workflows, and real-time streaming capabilities.
-
Ensure adherence to security best practices, including authentication, authorization, API rate limiting, and data encryption.
-
Collaborate with AI researchers and product teams to translate AI capabilities into production-ready backend services.
Required Technical Skills:
-
Strong experience with backend development using Python, with familiarity in Go, Node.js, or Java for building scalable web services and APIs.
-
Hands-on experience with containerization using Docker and orchestration platforms including Kubernetes, OpenShift, and AWS ECS in production environments.
-
Proficient with cloud infrastructure, particularly AWS services (Lambda, ECS, EKS, S3, RDS, ElastiCache) and serverless architectures.
-
Experience with CI/CD pipelines using Jenkins, GitLab CI, GitHub Actions, or similar tools, including Infrastructure as Code with Terraform or CloudFormation.
-
Strong knowledge of databases including PostgreSQL, MongoDB, Redis, and experience with vector databases for AI applications.
-
Familiarity with message queues (RabbitMQ, Apache Kafka, AWS SQS/SNS) and event-driven architectures.
-
Experience with monitoring and observability tools such as Prometheus, Grafana, DataDog, or equivalent platforms.
-
Knowledge of AI/ML model serving frameworks like MLflow, Kubeflow, TensorFlow Serving, or Triton Inference Server.
-
Understanding of API design principles, load balancing, caching strategies, and performance optimization techniques.
-
Experience with microservices architecture, distributed systems, and handling high-traffic, low-latency applications.