Site Reliability Engineer
okta
Job Description
- Collaborate with engineering teams to improve availability, reliability, and observability of their services.
- Participate in regular on-call rotations to ensure 24/7 coverage of all critical systems
- Use existing monitoring tools to identify problems and resolve and/or escalate to service teams
- Implement changes to enable or improve infrastructure resilience, monitoring, and alerting
- Develop and do continuous refinement of SRE tools and processes to improve software delivery, observability, reliability, and operational efficiency.
- Daily coding, scripting, and development - Go, Terraform, Helm, etc
- Optimize existing systems and eliminate toil through simplification and automation.
- Define, document, and advocate reliability best practices and policies
You might be a good fit if you:
- Have 3+ years industry experience as a Site Reliability Engineer
- Have experience in Golang
- Have experience in managing infrastructure with Terraform at scale
- Are comfortable working with a fully distributed team
- Have experience as software developer in a SaaS environment
- Have experience in a production environment supporting large-scale, mission-critical applications
- Have demonstrable expertise working with Microsoft Azure and/or Amazon Web Services.
- Production on-call experience in a 24/7 cloud based environment
- Have a good understanding of microservices, cloud infrastructure (AWS, Azure, GCP), databases (SQL, No-SQL, Key/Value), containers (docker, kubernetes), web technologies (web sockets, http) and networking (SSL, routing, VPN)
- Exceptional communication skills, including technical writing in the English language
- Have a systematic problem-solving approach, coupled with a strong sense of ownership and drive
- Comfortable with the Agile software development methodology
- Loves to work as a team, but is able to work effectively in a remote environment where tasks may be self-driven