Lead DevOps / Cloud Engineer — Multimodal Search Platform
zigya
Job Description
-
- 6–10 years of experience in DevOps, Site Reliability Engineering (SRE), or Cloud Platform Engineering roles within high-scale technology environments.
-
- Strong hands-on expertise with at least one major cloud platform — AWS, GCP, or Azure — including networking, compute, storage, and managed Kubernetes services.
-
- Deep experience operating production Kubernetes environments at scale, including autoscaling, cluster upgrades, workload orchestration, and resilience design.
-
- Proven experience implementing Infrastructure as Code using Terraform (preferred) or equivalent tooling.
-
- Strong understanding of distributed systems reliability, including load balancing, caching strategies, asynchronous queues, and failure recovery patterns.
-
- Experience designing and managing CI/CD pipelines using modern tooling (GitHub Actions, GitLab CI, ArgoCD, Jenkins, or equivalent).
-
- Hands-on experience building observability stacks using tools such as Prometheus, Grafana, ELK/OpenSearch, Datadog, or OpenTelemetry.
-
- Experience supporting GPU workloads and AI inference systems, including containerized model deployment and performance optimization for production ML systems.
-
- Familiarity with AI model serving frameworks such as Triton Inference Server, vLLM, TGI, or similar platforms is strongly preferred.
-
- Strong scripting and automation skills (Python, Bash, or Go preferred).
-
- Solid understanding of networking, security best practices, secrets management, and cloud cost optimization strategies.
-
- Experience working in fast-moving startup or scale-up environments with high ownership expectations.