Senior Site Reliability EngineerSenior Site Reliability Engineer
okta
Job Description
What you’ll be doing
- Designing, building, running, and monitoring Okta's production infrastructure
- Be an evangelist for security best practices and also lead initiatives/projects to strengthen our security posture for critical infrastructure
- Responding to production incidents and determining how we can prevent them in the future
- Triaging and troubleshooting complex production issues to ensure reliability and performance
- Identifying and automating manual processes
- Continuously evolving our monitoring tools and platform
- Promoting and applying best practices for building scalable and reliable services across engineering
- Developing and maintaining technical documentation, runbooks, and procedures
- Supporting a 24x7 online environment as part of an on-call rotation
- Be a technical SME for a team that designs and builds Okta's production infrastructure, focusing on security at scale in the cloud.
What you’ll bring to the role
- Are always willing to go the extra mile: see a problem, fix the problem.
- Are passionate about encouraging the development of engineering peers and leading by example.
- Have experience automating, securing, and running large-scale production Java/Tomcat and containerized services in AWS (EC2, ECS/EKS, KMS, Kinesis, RDS) or other cloud providers.
- Experience deploying and managing Kubernetes/K8s clusters (EKS preferred). Experience with monitoring/alerting in the kubernetes eco system, and with deploying microservices
- Have deep knowledge of CI/CD principles, Linux fundamentals, OS hardening, networking concepts, and IP protocols.
- Have a deep understanding and familiarity with configuration management tools like Chef, Terraform, and Ansible.
- Have expert-level abilities in operational tooling languages such as Ruby, Python, Go and shell, and use of source control.
- Familiar with industry-standard security tools like Nessus and OSQuery.
- Familiar with data stores such as RDS, S3, Redis, Cassandra, and Elasticsearch.
Experience in the following
- 5+ years of experience architecting and running complex AWS or other cloud networking infrastructure resources
- 5+ years of experience with Infrastructure As Code such as Terraform, Chef or Ansible;
- 4+ years of experience with Kubernetes/ K8s;
- Strong Linux understanding and experience;
- Strong security background and knowledge;
- BS In computer science (or equivalent experience).