Sr.Engineer II - Incident Excellence
hashicorp
Job Description
In this role, you can expect to:
- Be responsible for and drive incident management capabilities and culture.
- Contribute to incident command on-call
- Build technical skills and relationships within a team of engineers and SREs.
- Lead and refine our incident response strategy, ensuring rapid and effective response to operational disruptions.
- Analyze incident trends and root causes to drive continuous improvements in system reliability and response processes.
- Develop and maintain tools for incident detection, analysis, and resolution, automating responses where possible to minimize human intervention.
- Create comprehensive incident response documentation and conduct training sessions to prepare all relevant teams for effective incident handling.
- Work closely with development, operations, and security teams to coordinate incident response efforts and post-incident analyses.
You may be a good fit for our team if:
- Minimum 10 - 12 years of experience in site reliability engineering, systems administration, or software engineering, with a significant focus on incident response and operational reliability.
- 8+ years managing, coordinating, and ensuring resolution of major incidents.
- Professional experience with incident management in cloud environments.
- Enjoy working on a variety of scopes spanning software engineering, cloud infrastructure, and SRE.
- Proven track record of managing and resolving incidents in cloud-based environments, with expertise in major public cloud platforms (AWS, GCP, Azure).
- Understanding of fundamental network technologies like DNS, Load Balancing, SSL, TCP/IP, HTTP
- Strong understanding of monitoring and alerting systems, with the ability to develop metrics and alarms that accurately reflect system health and operational risks.
- Experience with incident management tools and practices, including post-mortem analysis and root cause investigation.
- Passion for consistently responding to and leading complex incidents in a 24x7x365 environment utilizing a globalized follow-the-sun model.
- Customer-centric attitude with a focus on providing best-in-class incident response for customers and stakeholders
- Familiarity with HashiCorp’s product suite and infrastructure automation tools is a plus.
- Demonstrate strong leadership skills during periods of significant business impact, remaining calm and professional during high-pressure situations
- A strong desire to drive customer success with partner teams and management on high-profile issues critical to the long-term success of the business
- Outstanding verbal and written communication skills with the ability to convey information in a meaningful way to both engineers and executive-level management, during and outside of incidents
- Adaptable to a wide variety of technologies and capable of incident response and troubleshooting activities in complex interconnected environments