Lead Technology Specialist(Lead Site Reliability Engineer)

caterpillar

Bangalore NM Years Exp Posted 65d ago

Job Description

What you will do

Provision, configure, and maintain Kubernetes clusters on on‑premises infrastructure (bare metal or virtualized) and in AWS (e.g., EKS).
Implement and manage Infrastructure as Code (IaC) and automated workflows for cluster creation, upgrades, and application deployments (e.g., Terraform, Ansible, Helm, Git‑based pipelines).
Establish and operate comprehensive observability (metrics, logs, traces), including SLI/SLO definitions, alerting, dashboards, and runbooks for platform and key services.
Monitor environment health (control plane and node components), capacity, performance, and cost; perform tuning and right‑sizing across on‑prem and cloud.
Execute bug triage: reproduce issues, collect diagnostics, perform root‑cause analysis, and coordinate fixes with platform/application teams and vendors.
Lead incident response for reliability events (degradations, outages), post‑incident reviews, and preventive actions.
Administer Kubernetes security controls (RBAC, network policies, secrets management, image signing/scanning), certificate management, and compliance control implementation.
Manage platform services (container registry, ingress/controllers, CNI, storage classes/CSI, service mesh where applicable).
Implement backup/restore and disaster recovery strategies for clusters and stateful workloads (e.g., Velero), validate regularly.
Maintain and improve CI/CD workflows integrating testing, policy checks, and progressive delivery for platform and shared services.
Create and maintain operational documentation: standards, diagrams, runbooks, automation playbooks, and knowledge base articles.
Collaborate with networking, security, and application teams to ensure reliability, performance, and secure connectivity across data centers and AWS.
Drive continuous improvement: reliability engineering practices, toil reduction, automation, and change management processes.

What you will have

Kubernetes administration and operations on on‑premises and AWS environments (cluster lifecycle, upgrades, node management, workload scheduling).
Infrastructure as Code and automation and Git‑based CI/CD.
Observability stacks and tooling (e.g., Prometheus, Grafana, Alertmanager, OpenTelemetry; ELK/Loki‑class logging).
Linux systems administration (container runtime, networking, storage.
Networking fundamentals applied to Kubernetes (CNI, DNS, Ingress/Load Balancing, TLS/cert management, basic L3/L4 concepts).
Security best practices (RBAC, pod security standards, network policies, image scanning, secrets management).
Experience with incident response, on‑call participation, and root‑cause analysis in production environments.
Strong documentation and communication skills; ability to work effectively with geographically distributed teams.

Top Candidates Will Also Have:

Experience with service mesh (e.g., Istio/Linkerd) and advanced container networking (e.g., eBPF‑based data paths, network policy engines).
Familiarity with backup/DR tooling for Kubernetes (e.g., Velero) and stateful workload recovery.
Exposure to Operational Technology (OT) or edge/remote site constraints and ruggedized deployments.
Experience with configuration compliance, policy‑as‑code (e.g., Open Policy Agent), and supply‑chain security.
Knowledge of platform registry operations, image lifecycle, and vulnerability management.
This position requires candidate to work a 5-day -a -week schedule in the office

Lead Technology Specialist(Lead Site Reliability Engineer)

Job Description

Similar Openings for You

Data Engineer

AI Data Foundation Engineer

Senior Data Engineer- Spark, Abinitio, Python, SQL, Data warehouse

Senior Software Engineer- Data Engineering