Senior Site Reliability Engineer
okta
Job Description
What you’ll be doing:
- Design, build, maintain, and deploy robust tools and pipelines to automate infrastructure provisioning, configuration, and deployment across multiple cloud environments (AWS, GCP, etc.);
- Accountable and responsible for the set-up, maintenance, and ongoing development of artifactory application suite: artifactories, globalization, disaster recovery
- Create and maintain fully automated CI build pipelines for multiple services.
- Architect, implement, and manage highly available and scalable cloud-native platforms leveraging Kubernetes, Linux, and other cutting-edge technologies;
- Develop and manage efficient multi-cloud deployment strategies, ensuring seamless application and infrastructure orchestration across diverse cloud environments;
- Create and maintain custom Amazon Machine Images (AMIs) tailored to specific application and infrastructure needs, optimizing performance and security;
- Collaborate with engineering and operations teams to ensure the reliability, performance, and security of production systems;
- Respond promptly to production incidents, troubleshoot complex issues, and implement preventive measures to minimize future disruptions;
- Build scalable and extensible platforms, services, and tools using Java, Python, Go, and other relevant technologies, with a focus on automation, reliability, and security;
- Identify and eliminate bottlenecks, manual processes, and inefficiencies, implementing automated solutions to improve operational efficiency and reduce human error;
- Leverage industry best practices in infrastructure, automation, and orchestration to drive innovation and explore emerging technologies that can enhance the platform's capabilities;
- Develop self-service tools and processes to empower teams to independently manage their infrastructure, reducing reliance on manual intervention; and
- Prioritize security and compliance by maintaining up-to-date base images, applying security patches, and implementing robust security measures to protect sensitive information and systems.
What we are looking for:
- 5+ years of experience with Java, Go, Python, or similar backend languages
- 5+ years of experience building, maintaining, and debugging services, internal tools, and frameworks
- 5+ years of experience automating and deploying large-scale production services in AWS, GCP, or similar
- 5+ years experience managing CI/CD infrastructures, with a strong proficiency in Tools like Spinnaker, Jenkins, ArgoCD, Gitlab or any CI/CD to streamline deployment pipelines and ensure efficient software delivery.
- Strong understanding of Kubernetes fundamentals, cluster administration, and container orchestrationIn-depth knowledge of Linux systems, including system administration, shell scripting, and security best practices
- In-depth knowledge of Artifactory, or other storage & replication service like EKR/GCR
- Experience in designing, building, and managing complex deployment pipelines across multiple cloud providers
- Expertise in creating, configuring, and managing custom AMIs for various workloads and environments