Site Reliability Engineer
equifax
Job Description
What you'll do:
-
Kubernetes: Design, deploy, and manage production-ready Kubernetes clusters.
-
Cloud Infrastructure: Build and maintain scalable infrastructure on GCP using tools like Terraform.
-
Performance: Identify and resolve performance bottlenecks in applications and infrastructure.
-
Observability: Implement monitoring and logging to proactively detect and resolve issues.
-
Incident Response: Participate in on-call rotations, troubleshooting and resolving production incidents.
-
Collaboration: Promote reliability best practices and ensure smooth deployments.
-
Automation: Build CI/CD pipelines, automated tooling, and runbooks.
-
Problem Solving: Triage complex issues, lead blameless postmortems, and drive remediation.
-
Mentorship: Guide and mentor other SREs.
What experience you need
-
BS in Computer Science or related field.
-
2+ years of experience developing and/or administering software in public cloud
-
5+ years of programming experience (Python, Bash/Shell Script, Java, Go, etc.).
-
3+ years of experience monitoring infrastructure and application performance.
-
5+ years experience of system administration skills, including automation and orchestration of Linux/Windows using Terraform, Chef, Ansible and/or containers (Docker, Kubernetes, etc.)
-
5+ years experience working with continuous integration and continuous delivery tooling and practices