Sr. Site Reliability Engineer
hashicorp
Job Description
- Implement best practices for system reliability, including proactive identification of potential failure points and the development of automated mitigations
- Design and execute comprehensive load testing strategies to identify performance bottlenecks and scalability limits across our cloud products
- Implement best practices and technologies to improve system resilience, ensuring high availability and fault tolerance.
- Work closely with engineering and product teams to integrate operational readiness into the development lifecycle, enhancing product stability and user satisfaction.
- Build and refine tools and frameworks for automated testing, environment simulation, and incident reproduction, reducing manual effort and increasing test coverage.
- Conduct in-depth analysis of testing results, documenting findings and making actionable recommendations for system enhancements.
- Drive Systemic Improvements to the products by introducing Chaos Testing and partnering with product development teams.
- Share your knowledge and expertise with team members, fostering a culture of learning and continuous improvement.
- Develop and implement disaster recovery and backup strategies to ensure data integrity and system resilience.
Ideal Candidate
- 5+ years of experience in SRE , systems engineering, or non functional testing roles with a focus on operational readiness, performance testing, or system scalability.
- Experience in driving systemic improvements through Chaos engineering practices.
- Programming skills in any of the high level languages or scripting
- Proven track record of leading successful load testing and performance optimization initiatives in cloud and on-prem environments.
- Experience in creating and managing test environments for automated testing.
- Strong fundamentals of CI/CD process and maintaining quality pipelines.
- Experience with version control systems (e.g., Git) and agile project management methodologies
- Understanding of monitoring and alerting systems, with the ability to develop metrics and alarms that accurately reflect system health and operational risks.
- Strong technical foundation in cloud technologies ( AWS, Azure, Or GCP) and container technologies like Nomad or Kubernetes.
- Strong experience with performance testing tools like K6, Artillery, Vegeta, Locust etc
- Effective communication and collaboration skills, capable of working with cross-functional teams and articulating technical concepts to diverse audiences.
- Familiarity with HashiCorp products and tools is a plus.
- Exposure to the disaster recovery domain is a plus.#LI-Hybrid