TechOps-DE-CloudOps-CLOUD COMPUTING - AWS Infra
ey
Job Description
Your key responsibilities
- Lead incident response and coordination for AWS infrastructure issues, ensuring timely troubleshooting and resolution.
- Act as the primary escalation point for critical incidents that require in-depth analysis and coordination with engineering teams.
- Own and execute SOPs and runbooks to manage cloud infrastructure-related requests, issues, and remediation activities.
- Review and refine incident handling processes to enhance troubleshooting efficiency within the AHD team.
- Conduct log analysis and system diagnostics using various tools and ITSM tool’s work notes.
- Ensure proper access management & request fulfilment, including IAM role validation, security configurations, and VPC networking support is provided by the team
- Monitor and troubleshoot containerized environments and infrastructure components.
- Provide technical mentorship and training for junior engineers, improving incident handling and automation skills.
- Work closely with product teams to identify recurring issues, document knowledge base updates, and drive SOP/process standardization.
- Participate in shift handovers and governance meetings, ensuring knowledge transfer and clear communication of ongoing issues.
- Provide guidance to junior engineers in handling cloud infrastructure issues and best practices
Skills and attributes for success
- Strong technical leadership and escalation management skills.
- Deep expertise in AWS infrastructure operations, including EC2, IAM, VPC, and security groups.
- Hands-on experience with Kubernetes (EKS), Helm, and container orchestration.
- Strong log analysis and troubleshooting experience using AWS CloudWatch and OpenTelemetry (OTEL).
- Experience working with ITSM tools.
- Ability to analyse trends, identify recurring issues, and propose automation-driven solutions.
- Excellent communication and stakeholder coordination skills to work with product teams.
- Experience in refining SOPs, troubleshooting guides, and runbooks for operational efficiency.
To Qualify for the Role, You Must Have
- 7+ years of experience in cloud infrastructure operations, incident management, and technical support.
- Deep understanding of AWS security principles, IAM policies, and encryption mechanisms.
- Experience troubleshooting and managing Kubernetes (EKS), Helm, and containerized workloads.
- Experience working with ITSM tools.
- Strong problem-solving skills with experience in handling major incidents and leading root cause analysis (RCA).
- Willingness to work in a 24x7 rotational shift-based support environment.
- No location constraints; ability to collaborate with global teams.