Site Reliability Engineer
equifax
Job Description
Key responsibilities
-
Operations experience in supporting highly scalable systems.
-
Ability to operate in a 24x7 environment encompassing global time zones
-
Experience designing and implementing an effective and efficient CI/CD flow that gets code from dev to prod with high quality and minimal manual effort is desired
-
Kubernetes: Design, deploy, and manage Kubernetes clusters in production, optimizing for performance and reliability.
-
Cloud Infrastructure: Build and maintain scalable infrastructure on GCP (or other cloud providers), leveraging automation tools like Terraform.
-
Performance Engineering:
-
Identify and analyze performance bottlenecks in applications and infrastructure.
-
Develop and implement performance optimizations.
-
Observability: Implement comprehensive monitoring and logging solutions to proactively detect and resolve issues.
-
Incident Response: Participate in on-call rotations, troubleshooting and resolving production incidents with a focus on minimizing downtime.
-
Collaboration: Work closely with product development teams to promote reliability best practices and ensure smooth deployments.
-
Manage system(s) uptime across cloud-native (AWS, GCP) and hybrid architectures.
-
Build infrastructure as code (IAC) patterns that meet security and engineering standards using one or more technologies (Terraform, scripting with cloud CLI, and programming with cloud SDK).
-
Build CI/CD pipelines for build, test and deployment of application and cloud architecture patterns, using platform (Jenkins) and cloud-native toolchains.
-
Build automated tooling to deploy service request to push a change into production
-
Solve problems and triage complex distributed architecture service map.
-
Build runbooks that are comprehensive and detailed to manage detect, remediate and restore services.
-
Lead availability blameless postmortem and own the call to action to remediate recurrences.
-
On call for high severity application incidents and improving run books to improve MTTR
-
Participate in a team of first responders 24/7, follow the sun operating model for incident and problem management.
-
Effectively communicate to technical peers and team members in both written and verbal formats.
What experience you need
-
Bachelor degree in Computer Science or related technical field involving coding (e.g., physics or mathematics), or equivalent job experience required
-
5-7years of experience working with containers (Docker, Kubernetes).
-
5-7 years of experience working with public cloud environments ( GCP preferred)
-
Strong system administration skills, including automation and orchestration on Linux.
-
Strong Kubernetes knowledge and hands-on production administration skills.
-
Programming experience in one or more languages such as Python, Bash, Java, Go, Groovy or similar languages.
-
Proficient in Identifying and analyzing performance bottlenecks in applications and infrastructure
-
Proficiency with continuous integration and continuous delivery (CI/CD) using tools like Jenkins, Git.
-
5-7 years of experience monitoring infrastructure and application performance.
-
Solid understanding of application design principles and trade-offs.
-
Knowledge of network infrastructure and security basics (DNS, subnets, firewalls, load balancers).