Senior Cloud Monitoring Administrator
zs
Job Description
What You’ll Do:
- Own and lead high-impact incident management (P1/P2) processes end-to-end.
- Facilitate incident bridges and war rooms with cross-functional teams (Cloud Compute, Network, Security, Cloud Ops).
- Coordinate with global stakeholders, vendors, and leadership for real-time updates and escalations.
- Maintain real-time communication on Team/ServiceNow and through structured email updates.
- Conduct in-depth Post-Incident Reviews (PIR) and ensure follow-ups via Problem Management.
- Track incident metrics (MTTR, SLA breaches, recurrence), analyze trends, and recommend improvements.
- Partner with engineering and automation teams to enhance observability and proactive detection.
- Standardize and enhance ITOC’s incident response processes based on ITIL best practices.
- Drive improvements in incident communication protocols, documentation, and playbooks.
- Mentor junior engineers (L1/L2) in handling escalations and developing response skills.
- Partner with Observability Team,Cloud Compute, Network, Security and Cloud Ops to enable integrated monitoring and alerting.
- Collaborate with application and business teams to minimize business disruption and align resolution priorities.
- Participate in Change Advisory Board (CAB) to mitigate incident risks from changes.
What You’ll Bring:
- 4+ years of experience in IT Operations/Incident Management roles, with at least 2–3 years handling global environments.
- Prior experience in a Consultant/L3 capacity in a matrixed or client-facing IT environment.
- Strong expertise in handling hybrid infrastructure (AWS + On-prem) incidents.
- Proven success in independently leading major incidents and stakeholder management.
- Tools: ServiceNow, JIRA, SolarWinds, Splunk, AWS CloudWatch, PagerDuty/Uptrends, Teams.
- Cloud: Working knowledge of Networking technologies, VMWare, AWS (EC2, RDS, Route 53, ELB, VPC, etc.)
- Concepts: ITIL (Incident, Problem, Change), Monitoring & Alerting, Automation basics (preferred)
- Certifications: ITIL v4 Foundation (required); AWS Cloud Practitioner or higher (preferred)
- Clear and timely communication during critical scenarios.
- Strong decision-making and accountability under pressure.
- Ability to influence cross-functional teams without direct authority.
- Structured thinking with an eye for continuous service improvement.
- Willingness to work in 24x7 support environment (via on-call rotation).