Senior Site Reliability/DevOps Engineer
equifax
Job Description
What you’ll do
-
Architecture and Design: Participate in the design and architecture of highly scalable, resilient, and secure systems on Kubernetes. Contribute to the definition of SRE principles and best practices.
-
Automation: Develop and maintain automation frameworks for infrastructure provisioning, deployment, monitoring, and incident response using tools like Terraform, Ansible, Puppet, Chef, or similar.
-
Monitoring and Alerting: Design and implement comprehensive monitoring and alerting systems to proactively identify and resolve issues. Develop and maintain dashboards to track key performance indicators (KPIs).
-
Incident Management: Lead incident response efforts, conducting thorough post-incident reviews to identify root causes and implement preventative measures.
-
Capacity Planning: Proactively identify and address capacity constraints to ensure optimal system performance and availability.
-
Collaboration: Work closely with engineering, product, and security teams to ensure seamless collaboration and alignment on system requirements and priorities.
-
Mentorship: Mentor and guide junior SRE/DevOps engineers, fostering a culture of continuous learning and improvement.
-
On-call Rotation: Participate in a rotating on-call schedule to provide 24/7 support for critical systems.
-
Security: Contribute to the security posture of our systems by implementing security best practices and participating in security audits and reviews.
-
Performance Optimization: Identify and resolve performance bottlenecks, optimizing system performance and resource utilization.
What experience you need
-
7+ years of experience as an SRE, DevOps Engineer, or in a similar role.
-
Deep understanding of cloud platforms such as GCP (AWS and Azure are a plus)
-
Extensive experience with containerization technologies like Docker and Kubernetes.
-
Proven experience with configuration management tools (e.g., Terraform, Ansible, Puppet, Chef).
-
Strong scripting skills (e.g., Python, Go, Bash, Shell).
-
Experience with monitoring and logging tools (e.g., DataDog, Prometheus, Grafana, Datadog, ELK stack).
-
Experience with CI/CD pipelines and tools (e.g., Jenkins, GitLab CI, CircleCI).
-
Experience with incident management and post-incident reviews.
-
Excellent problem-solving and troubleshooting skills.
-
Strong communication and collaboration skills.
-
Bachelor's degree in Computer Science or a related field; equivalent experience considered.