DevOps / SRE Lead
ripplehire
Job Description
Key Responsibilities
• Design, build, and operate cloud-native infrastructure on Azure and on-premise data center using infrastructure-as-code principles (Ansible, Terraform).
• Architect and manage Kubernetes (AKS / self-managed) clusters at scale; enforce GitOps workflows
• Drive adoption of Platform Engineering practices - build internal developer platforms (IDPs) leveraging Backstage or equivalent to reduce cognitive load on dev teams.
• Manage and optimise container image lifecycle, registries (ACR, ECR), and multi-environment deployment strategies (blue-green, canary, rolling).
• Implement full-stack observability using the OpenTelemetry standard - metrics (Prometheus / Thanos), logs (Loki / EFK / OpenSearch), and traces (Jaeger, Tempo).
• Build and maintain Grafana dashboards, runbooks, and SLI/SLO frameworks; drive error-budget culture with tech/product teams.
• Lead incident response and continuous reliability improvement.
• Embed security into every layer: network policies, RBAC, OPA/Gatekeeper policies in Kubernetes, image signing (Cosign/Notary).
• Manage secrets hygiene, certificate lifecycle (cert-manager), and cloud IAM with least-privilege principles.
• Ensure compliance alignment (SOC 2, PCI-DSS awareness) for production workloads.
• Operate and optimise event-streaming infrastructure - Apache Kafka, NATS, or RabbitMQ.
• Support database reliability for PostgreSQL, MSSQL, MongoDB, Redis; coordinate DBA activities for backups, failover, and performance.
• Collaborate closely with application development, QA, product, and security teams to align DevOps strategy with business goals.
• Manage vendor and tool evaluations, present infrastructure roadmaps to technical leadership.
Skills
• Deep hands-on experience with Azure (AKS, App Service, Azure Monitor, Key Vault, Azure DNS).
• Proficiency in Terraform (modules, remote state, workspaces) and Ansible for IaC.
• Strong Linux systems administration; networking fundamentals (DNS, TLS, load balancers, WAF, CDN - Akamai).
• Expert-level Kubernetes operations: Helm chart authoring, Kustomize, admission webhooks, HPA, cluster upgrades, multi-tenancy patterns.
• Experience with service mesh (Istio, Linkerd) traffic shaping, and observability.
• Container runtime security (Falco, Snyk) and registry management.
• Proficiency in at least two of: Python, Go, Bash - for tooling, automation, and custom operators/controllers.
• Experience writing Kubernetes operators or CRDs is a strong plus.
• Production experience with: Prometheus + manager + Thanos/Cortex, Grafana, Loki, OpenTelemetry Collector, Jaeger or Tempo.