Lead Technology Specialist(Lead Site Reliability Engineer)

caterpillar

Bangalore NM Years Exp Posted 17d ago

Job Description

What you will do

  • Provision, configure, and maintain Kubernetes clusters on on‑premises infrastructure (bare metal or virtualized) and in AWS (e.g., EKS).
  • Implement and manage Infrastructure as Code (IaC) and automated workflows for cluster creation, upgrades, and application deployments (e.g., Terraform, Ansible, Helm, Git‑based pipelines).
  • Establish and operate comprehensive observability (metrics, logs, traces), including SLI/SLO definitions, alerting, dashboards, and runbooks for platform and key services.
  • Monitor environment health (control plane and node components), capacity, performance, and cost; perform tuning and right‑sizing across on‑prem and cloud.
  • Execute bug triage: reproduce issues, collect diagnostics, perform root‑cause analysis, and coordinate fixes with platform/application teams and vendors.
  • Lead incident response for reliability events (degradations, outages), post‑incident reviews, and preventive actions.
  • Administer Kubernetes security controls (RBAC, network policies, secrets management, image signing/scanning), certificate management, and compliance control implementation.
  • Manage platform services (container registry, ingress/controllers, CNI, storage classes/CSI, service mesh where applicable).
  • Implement backup/restore and disaster recovery strategies for clusters and stateful workloads (e.g., Velero), validate regularly.
  • Maintain and improve CI/CD workflows integrating testing, policy checks, and progressive delivery for platform and shared services.
  • Create and maintain operational documentation: standards, diagrams, runbooks, automation playbooks, and knowledge base articles.
  • Collaborate with networking, security, and application teams to ensure reliability, performance, and secure connectivity across data centers and AWS.
  • Drive continuous improvement: reliability engineering practices, toil reduction, automation, and change management processes.

What you will have

  • Kubernetes administration and operations on on‑premises and AWS environments (cluster lifecycle, upgrades, node management, workload scheduling).
  • Infrastructure as Code and automation and Git‑based CI/CD.
  • Observability stacks and tooling (e.g., Prometheus, Grafana, Alertmanager, OpenTelemetry; ELK/Loki‑class logging).
  • Linux systems administration (container runtime, networking, storage.
  • Networking fundamentals applied to Kubernetes (CNI, DNS, Ingress/Load Balancing, TLS/cert management, basic L3/L4 concepts).
  • Security best practices (RBAC, pod security standards, network policies, image scanning, secrets management).
  • Experience with incident response, on‑call participation, and root‑cause analysis in production environments.
  • Strong documentation and communication skills; ability to work effectively with geographically distributed teams.

Top Candidates Will Also Have:

  • Experience with service mesh (e.g., Istio/Linkerd) and advanced container networking (e.g., eBPF‑based data paths, network policy engines).
  • Familiarity with backup/DR tooling for Kubernetes (e.g., Velero) and stateful workload recovery.
  • Exposure to Operational Technology (OT) or edge/remote site constraints and ruggedized deployments.
  • Experience with configuration compliance, policy‑as‑code (e.g., Open Policy Agent), and supply‑chain security.
  • Knowledge of platform registry operations, image lifecycle, and vulnerability management.
  • This position requires candidate to work a 5-day -a -week schedule in the office

Similar Openings for You