Lead DevOps Engineer
restroworks
Job Description
-
Design, implement, and manage scalable, secure, and highly available cloud infrastructure.
-
Build and optimize CI/CD pipelines to improve deployment efficiency and release reliability.
-
Automate infrastructure provisioning and configuration management using Infrastructure as Code (IaC) tools.
-
Monitor system performance, uptime, and reliability across production environments.
-
Lead DevOps best practices around deployment, observability, incident management, and disaster recovery.
-
Collaborate with engineering teams to improve application scalability, availability, and performance.
-
Manage Kubernetes clusters, containerized applications, and orchestration environments.
-
Implement security best practices across infrastructure and deployment pipelines.
-
Drive cost optimization initiatives for cloud infrastructure and services.
-
Mentor junior DevOps engineers and contribute to building a high-performance engineering culture.
-
Participate in on-call rotations and production incident resolution.
-
Design and manage Kafka/MSK-based event streaming systems for high-throughput microservices communication.
-
Support scalable distributed architectures handling billions of transactions/messages monthly.
-
Implement observability and monitoring solutions using Prometheus, Grafana, CloudWatch, Datadog, New Relic, Site24x7, and related tools.
-
Lead incident response, RCA analysis, platform uptime initiatives, and SRE best practices.
-
Implement DevSecOps and cloud security best practices using tools such as Wiz, Lacework, Snyk, SonarQube, JFrog X-Ray, AWS native security services, and F5.
-
Conduct regular infrastructure security audits, vulnerability management, governance reviews, and compliance checks. Exp in ISO, SOC 1 , SOC 2 audits will be added advantage.
-
Manage infrastructure for data migration initiatives using AWS DMS and related technologies.
-
Support data pipelines, integrations, and enterprise migration projects.
-
Drive AI-enabled DevOps automation for environment provisioning, log analysis, incident triaging, and operational efficiency improvements.
-