TechOps-DE-CloudOps-CLOUD COMPUTING - AWS Infra

Pune 7 Years Exp Posted 473d ago

Your key responsibilities

Lead incident response and coordination for AWS infrastructure issues, ensuring timely troubleshooting and resolution.
Act as the primary escalation point for critical incidents that require in-depth analysis and coordination with engineering teams.
Own and execute SOPs and runbooks to manage cloud infrastructure-related requests, issues, and remediation activities.
Review and refine incident handling processes to enhance troubleshooting efficiency within the AHD team.
Conduct log analysis and system diagnostics using various tools and ITSM tool’s work notes.
Ensure proper access management & request fulfilment, including IAM role validation, security configurations, and VPC networking support is provided by the team
Monitor and troubleshoot containerized environments and infrastructure components.
Provide technical mentorship and training for junior engineers, improving incident handling and automation skills.
Work closely with product teams to identify recurring issues, document knowledge base updates, and drive SOP/process standardization.
Participate in shift handovers and governance meetings, ensuring knowledge transfer and clear communication of ongoing issues.
Provide guidance to junior engineers in handling cloud infrastructure issues and best practices

Skills and attributes for success

Strong technical leadership and escalation management skills.
Deep expertise in AWS infrastructure operations, including EC2, IAM, VPC, and security groups.
Hands-on experience with Kubernetes (EKS), Helm, and container orchestration.
Strong log analysis and troubleshooting experience using AWS CloudWatch and OpenTelemetry (OTEL).
Experience working with ITSM tools.
Ability to analyse trends, identify recurring issues, and propose automation-driven solutions.
Excellent communication and stakeholder coordination skills to work with product teams.
Experience in refining SOPs, troubleshooting guides, and runbooks for operational efficiency.

To Qualify for the Role, You Must Have

7+ years of experience in cloud infrastructure operations, incident management, and technical support.
Deep understanding of AWS security principles, IAM policies, and encryption mechanisms.
Experience troubleshooting and managing Kubernetes (EKS), Helm, and containerized workloads.
Experience working with ITSM tools.
Strong problem-solving skills with experience in handling major incidents and leading root cause analysis (RCA).
Willingness to work in a 24x7 rotational shift-based support environment.
- No location constraints; ability to collaborate with global teams.