Network Service Reliability Advisor
arm
Job Description
Responsibilities:
- Lead sophisticated solve and resolution of network incidents spanning LAN, WAN, VPN, SD-WAN, data centers, and cloud networks (AWS, Azure, GCP).
- Drive adoption and integration of AI Ops tools (e.g., Dynatrace, LogicMonitor) to enable proactive anomaly detection, alert correlation, and incident automation.
- Work with engineering and platform teams to expand observability coverage, tune alerting thresholds, and onboard new network services to SRC monitoring.
- Perform deep-dive root cause analyses (RCAs), lead incident reviews, and implement preventive actions to improve service resilience.
- Design and build dashboards, reliability reports, and KPIs (MTTR, latency, packet loss, availability) to improve visibility and decision-making.
- Contribute to network automation initiatives using tools like Ansible and Terraform; develop and maintain intelligent playbooks for remediation workflows.
- Tune and optimize AI/ML models used in telemetry analysis and predictive incident detection.
- Work on a shift pattern, on a 24/7/365 operating model, while being able to work independently and flexibly in response to emergencies or critical issues
- Certifications such as Cisco CCNA/CCNP, CompTIA Network+, or equivalent.
- In addition, the Cisco DevNet Certification would be highly advantageous.
- Hands-on experience with network technologies and protocols (TCP/IP, BGP, OSPF, DNS, DHCP, SDWAN).
- Experience with public cloud networking (AWS, Azure, GCP).
- Familiarity with ITIL and SRE principles (SLI/SLOs, error budgets, incident command).
- Experience integrating AI Ops tools with ITSM systems (e.g., ServiceNow, Jira Service Management).
- Exposure to automation/orchestration tools (Ansible and Terraform).
Required Skills and Experience:
- 3–6 years of hands-on experience in Platform Operations, or Infrastructure Support roles.
- Good experience with observability tools (e.g., Dynatrace, Logic Monitor, Datadog, Splunk) for real-time monitoring, alerting, and diagnostics.
- Proficiency in a scripting or programming languages (e.g., Python, Java, .NET, Node.js, Ansible or JavaScript).
- Practical knowledge of infrastructure automation using Ansible, including writing playbooks.
- Proficient in ticket management via an ITSM platform such as ServiceNow.
- Experience leading incident response, driving service restoration and coordinating root cause analysis.
- Effective communicator within a team with a proactive approach and personal accountability for outcomes.
- Ability to analyze incident patterns and metrics to proactively recommend reliability improvements.