Staff Site Reliability Engineer
okta
Job Description
Skills
- Exceptional communication skills, including technical writing in the English language
- Systematic problem-solving approach, coupled with a strong sense of ownership and drive
- Understanding of microservices, cloud infrastructure (AWS, Azure), databases (SQL, No-SQL, Key/Value), containers (docker, kubernetes), web technologies (web sockets, http) and networking (SSL, routing, VPN)
- Live and breathe SLIs, SLOs, error budgets and SLAs
- Strong belief in automating everything and reducing toil for yourself and teammates
- Loves to work as a team, but is able to work effectively in a remote environment where tasks may be self-driven
Responsibilities
- Working with the other teams to run, own and improve incident response processes
- Participate in regular on-call rotations to ensure 24/7 coverage of all critical systems
- Use existing monitoring tools to identify problems and resolve and/or escalate to service teams
- Implement changes to enable or improve infrastructure resilience, monitoring, and alerting
Experience
- 7+ years as a Site Reliability Engineer or in a Cloud Operations/DevOps role
- 6+ years using golang, shell scripting and terraform
- 2+ years as software developer in a SaaS environment
- 4+ years in a production environment supporting large-scale, mission-critical applications