Sr Staff- Cloud IT Platform Engineering
qualcomm
Job Description
Platform Strategy & Architecture: Drive the multi-year platform roadmap across EKS maturation and multi-cloud expansion; establish scalable platform patterns for multi-tenant environments including cluster lifecycle, upgrade strategies, and operational readiness; co-own architectural governance as a peer to others in IT.
• Infrastructure as Code: Drive the Terraform platform strategy across module design, state management, multi-environment patterns, and IaC governance; establish reference implementations and guardrails that enable distributed engineering teams to deliver infrastructure consistently and safely at scale.
• Kubernetes Platform Engineering (EKS): Drive end-to-end cluster provisioning and lifecycle management including autoscaling with Karpenter, add-on lifecycle management, and cluster upgrade and rollout strategies; establish organizational standards for cluster operations that scale across multiple teams and environments.
• Cloud Implementation: Own & drive the deployment of platform solutions with a strong emphasis on AWS, Infrastructure as Code, and GitOps driven workflows.
• GitOps & CI/CD Enablement: Build and maintain GitOps workflows using Argo CD and CI/CD pipelines using GitHub Actions, enabling repeatable, audited delivery patterns.
• Optimization & Performance: Continuously evaluate and tune platform capabilities and services to improve reliability, performance, and cost efficiency for development teams.
• Observability: Build and maintain robust monitoring/alerting and recovery processes for platform services and components leveraging Datadog and Prometheus/Grafana.
• Security & Policy: Implement secure-by-default cluster and workload patterns, including network and policy controls (e.g., Cilium and Kyverno), RBAC, and least-privilege access.
• Design & Collaboration: Contribute to technical design docs and proposals; collaborate with Platform Architects for validation/approval and drive implementation through delivery.
• Operations & On-call: Participate in a required on-call rotation, including potential 24/7 coverage; lead troubleshooting, root cause analysis, and corrective/preventative actions.