Lead Technology Specialist(Lead Site Reliability Engineer)
caterpillar
Job Description
What you will do
- Provision, configure, and maintain Kubernetes clusters on on‑premises infrastructure (bare metal or virtualized) and in AWS (e.g., EKS).
- Implement and manage Infrastructure as Code (IaC) and automated workflows for cluster creation, upgrades, and application deployments (e.g., Terraform, Ansible, Helm, Git‑based pipelines).
- Establish and operate comprehensive observability (metrics, logs, traces), including SLI/SLO definitions, alerting, dashboards, and runbooks for platform and key services.
- Monitor environment health (control plane and node components), capacity, performance, and cost; perform tuning and right‑sizing across on‑prem and cloud.
- Execute bug triage: reproduce issues, collect diagnostics, perform root‑cause analysis, and coordinate fixes with platform/application teams and vendors.
- Lead incident response for reliability events (degradations, outages), post‑incident reviews, and preventive actions.
- Administer Kubernetes security controls (RBAC, network policies, secrets management, image signing/scanning), certificate management, and compliance control implementation.
- Manage platform services (container registry, ingress/controllers, CNI, storage classes/CSI, service mesh where applicable).
- Implement backup/restore and disaster recovery strategies for clusters and stateful workloads (e.g., Velero), validate regularly.
- Maintain and improve CI/CD workflows integrating testing, policy checks, and progressive delivery for platform and shared services.
- Create and maintain operational documentation: standards, diagrams, runbooks, automation playbooks, and knowledge base articles.
- Collaborate with networking, security, and application teams to ensure reliability, performance, and secure connectivity across data centers and AWS.
- Drive continuous improvement: reliability engineering practices, toil reduction, automation, and change management processes.
What you will have
- Kubernetes administration and operations on on‑premises and AWS environments (cluster lifecycle, upgrades, node management, workload scheduling).
- Infrastructure as Code and automation and Git‑based CI/CD.
- Observability stacks and tooling (e.g., Prometheus, Grafana, Alertmanager, OpenTelemetry; ELK/Loki‑class logging).
- Linux systems administration (container runtime, networking, storage.
- Networking fundamentals applied to Kubernetes (CNI, DNS, Ingress/Load Balancing, TLS/cert management, basic L3/L4 concepts).
- Security best practices (RBAC, pod security standards, network policies, image scanning, secrets management).
- Experience with incident response, on‑call participation, and root‑cause analysis in production environments.
- Strong documentation and communication skills; ability to work effectively with geographically distributed teams.
Top Candidates Will Also Have:
- Experience with service mesh (e.g., Istio/Linkerd) and advanced container networking (e.g., eBPF‑based data paths, network policy engines).
- Familiarity with backup/DR tooling for Kubernetes (e.g., Velero) and stateful workload recovery.
- Exposure to Operational Technology (OT) or edge/remote site constraints and ruggedized deployments.
- Experience with configuration compliance, policy‑as‑code (e.g., Open Policy Agent), and supply‑chain security.
- Knowledge of platform registry operations, image lifecycle, and vulnerability management.
- This position requires candidate to work a 5-day -a -week schedule in the office