Site Reliability Engineer
equifax
Job Description
What you’ll do
-
Proven experience as a Site Reliability Engineer or Software Engineer with a focus on operations and automation.
-
Expert-level proficiency in a scripting/programming language (e.g., Python, Go).
-
Demonstrated experience in designing and building automation frameworks for infrastructure and application management.
-
Strong understanding of AI/ML concepts and practical experience applying them to operational data (e.g., anomaly detection, predictive analytics).
-
Deep expertise in observability tools (e.g., Looker, Prometheus, Grafana) and using data to drive decisions.
-
Excellent leadership and communication skills, with the ability to mentor team members and collaborate effectively with other engineering teams.
-
Manage system(s) uptime across cloud-native (AWS, GCP) and hybrid architectures.
Build infrastructure as code (IAC) patterns that meet security and engineering standards using one or more technologies (Terraform, scripting with cloud CLI, and programming with cloud SDK). -
Build automated tooling to deploy service requests to push a change into production. Build runbooks that are comprehensive and detailed to manage detect, remediate and restore services.
-
Solve problems and triage complex distributed architecture service maps. On call for high severity application incidents and improving run books to improve MTTR
-
Lead availability blameless postmortem and own the call to action to remediate recurrences.
What experience you need
-
BS degree in Computer Science or related technical field involving coding (e.g., physics or mathematics), or equivalent job experience required
-
7-10 years of experience in software engineering, systems administration, database administration, and networking.
-
4+years of experience developing and/or administering software in public cloud
-
Experience in monitoring infrastructure and application uptime and availability to ensure functional and performance objectives.
-
Experience in languages such as Python, Bash, Java, Go JavaScript and/or node.js
-
Demonstrable cross-functional knowledge with systems, storage, networking, security and databases
-
System administration skills, including automation and orchestration of Linux/Windows using Terraform, Chef, Ansible and/or containers (Docker, Kubernetes, etc.)
-
Proficiency with continuous integration and continuous delivery tooling and practices
-
Cloud Certification Strongly Preferred