GCP Infrastructure Engineer
UPS
Job Description
Cloud Infrastructure & Platform Engineering
Automation & Reliability
Security, Governance & Compliance
Monitoring, Observability & Cost Optimization
Collaboration & Enablement
-
-
Design, provision, and maintain scalable, secure, and cost-efficient infrastructure for GenAI applications on GCP.
-
Deploy and manage containerized workloads using Docker and Kubernetes (GKE).
-
Configure and optimize Vertex AI and IBM Watsonx platforms for training, fine-tuning, and serving LLMs and other generative models.
-
Implement high-performance GPU/TPU clusters to support distributed training and large-scale inference.
-
Ensure business continuity through backup, disaster recovery, and multi-region deployments.
-
Develop and maintain Infrastructure as Code (IaC) templates with Terraform, or Cloud Deployment Manager.
-
Adopt GitOps practices (Flux) for infrastructure lifecycle management.
-
Build and optimize CI/CD pipelines for data pipelines, model workflows, and GenAI applications.
-
Apply SRE principles (SLIs, SLOs, SLAs) to guarantee platform reliability and uptime.
-
Embed DevSecOps best practices across the infrastructure lifecycle, including policy-as-code, vulnerability scanning, and secrets management.
-
Enforce identity and access management (IAM), network segmentation, and data encryption in compliance with standards (HIPAA, SOX, GDPR, FedRAMP).
-
Collaborate with enterprise security and compliance teams to implement governance frameworks for GenAI platforms.
-
Implement observability stacks (Prometheus, Grafana, Cloud Monitoring, Datadog) for both infra health and ML-specific metrics (model drift, data anomalies).
-
Define KPIs to monitor system health, performance, and adoption across AI workloads.
-
Optimize cloud cost efficiency for GPU/TPU-intensive workloads using autoscaling, preemptible instances, and utilization monitoring.
-
Partner with data scientists, ML engineers, and software teams to streamline GenAI application development and deployment.
-
Provide onboarding, documentation, and reusable templates to enable faster adoption of AI infrastructure.
-