Specialist DevOps & Site Reliability Engineer
equilend
Job Description
Role Responsibilities
- Design, build, and manage CI/CD pipelines that streamline software delivery and reduce lead time to production.
- Develop and support scalable, containerized solutions using Docker and Kubernetes.
- Implement and manage Infrastructure-as-Code (IaC) using tools such as Terraform and Ansible to ensure consistent environments.
- Lead incident response and post-mortem processes, championing best practices in availability, latency, and system resilience.
- Define and maintain service-level indicators (SLIs), objectives (SLOs), and agreements (SLAs) for key systems.
- Collaborate with application teams to define monitoring strategies and observability standards using Grafana, Prometheus, or similar tooling.
- Partner with global DevOps and infrastructure teams to drive automation, performance improvements, and cost optimization in cloud environments (AWS preferred).
- Provide technical mentorship to team members and act as a subject matter expert in Site Reliability Engineering practices.
- Contribute to the development of operational playbooks and automated runbooks for common failure scenarios.
- Continuously evaluate and adopt emerging tools and technologies that support the goals of high availability and rapid delivery.
Required Skills
- A minimum of 6+ years of relevant experience in DevOps, SRE, or software infrastructure roles.
- Strong practical knowledge of CI/CD tooling (e.g., Jenkins, GitLab CI/CD, GitHub Actions) in distributed environments.
- Proven expertise with containerization and orchestration, especially Docker and Kubernetes.
- Hands-on experience with Infrastructure-as-Code tools like Terraform, Ansible, or Pulumi.
- Proficiency in scripting languages such as Python, Bash, or similar for automation and tooling.
- Experience implementing SRE principles, including reliability metrics, SLIs/SLOs, and chaos engineering practices.
- Strong familiarity with cloud infrastructure, ideally AWS; Azure or GCP experience also valuable.
- Demonstrated experience with monitoring and alerting frameworks, e.g., Prometheus, Grafana, ELK, or Splunk.
- AWS certification (Solutions Architect or DevOps Engineer) is a plus.
- Excellent collaboration and communication skills, with experience operating across globally distributed teams.