Lead Digital Engineer
darwinbox
Job Description
Roles & Responsibilities
EKS Infrastructure Ownership
• Own end-to-end design, provisioning, and management of Amazon EKS clusters using Terraform
• Define and maintain node group strategies including managed node groups, Fargate profiles, and spot/on-demand mix for cost optimization.
• Manage EKS upgrades, control plane configurations, and Kubernetes version lifecycle.
• Implement cluster autoscaler and Karpenter for dynamic workload scaling.
• Design multi-environment (Dev/Staging/Prod) EKS architectures with strong environment isolation.
CI/CD Pipeline Engineering
• Design, build, and maintain CI/CD pipelines using GitHub Actions or Jenkins for automated build, test, and deployment workflows.
• Implement deployment strategies including blue-green, canary, and rolling deployments to ensure zero-downtime releases.
• Integrate pipeline quality gates with security scanning (SAST/DAST), container image scanning, and policy compliance checks.
• Develop automated rollback mechanisms and deployment validation frameworks.
• Standardize pipeline templates and reusable workflow libraries across engineering teams.
Infrastructure as Code (IaC)
• Author and maintain Terraform modules for all AWS infrastructure — VPCs, EKS, IAM, S3, ECR, RDS, and more.
• Enforce IaC standards, module versioning, and Terraform state management using remote backends (S3 + DynamoDB).
• Implement drift detection mechanisms to continuously validate live infrastructure against IaC definitions.
• Manage Helm chart development and lifecycle for microservices deployments on EKS.
Security & Compliance
• Design and enforce least-privilege IAM policies, IRSA (IAM Roles for Service Accounts), and service mesh security policies.
• Manage secrets using AWS Secrets Manager and Parameter Store, integrated with Kubernetes workloads.
• Implement network security using VPC security groups, NACLs, and Kubernetes Network Policies.
• Drive infrastructure security compliance, vulnerability remediation, and audit readiness.
Observability & Incident Response
• Build and maintain observability stacks using Prometheus, Grafana, and OpenTelemetry for metrics, logs, and distributed tracing.
• Define SLIs, SLOs, and alerting thresholds for production Kubernetes workloads.
• Lead incident response, root cause analysis (RCA), and post-mortem processes for infrastructure events.
• Implement auto-remediation for common failure patterns to improve MTTR.
Cost Optimization & Capacity Planning
• Continuously analyze AWS spend and implement right-sizing, reserved instance, and savings plan strategies.
• Build cost attribution frameworks with tagging standards and chargeback models.
• Forecast capacity requirements based on business growth and workload patterns.
Collaboration & Mentorship
• Serve as the primary DevOps point of contact for product engineering teams, guiding infrastructure design decisions.
• Mentor junior and mid-level DevOps engineers, establishing best practices and runbook documentation.