Site Reliability Engineer
boomi
Job Description
What You’ll Do
-
Participate actively in detecting, remediating and reporting on Production incidents, ensuring the SLAs/ SLOs are defined and met.
-
Participate in on-call rotation to ensure coverage for planned/unplanned events.
-
Engage with other Engineering organizations to implement processes, identify improvements, and drive consistent results.
-
Working with your SRE and Engineering counterparts for driving DR exercises, Game days, training and other response readiness efforts.
-
Collaborate with Service Engineering organizations to build and automate tooling, implement best practices on Observability and manage the Boomi services in production and consistently achieve our market leading SLA.
-
Improving the scalability and reliability of Boomi’s systems in production.
-
Automate the provisioning and maintenance of Boomi’s infrastructure.
-
Work independently with a minimal level of guidance from technical leadership.
-
Mentor other Boomi engineers, including design collaboration and code reviews.
The Experience You Bring
-
Passionate about SRE, DevOps, Automation and infrastructure platforms. Expert in developing Ansible playbooks and automation for Infrastructure as code using Terraform and Cloud Formation Templates.
-
Expert in defining, measuring, and improving Reliability Metrics (SLO/SLI/ Error budgets).
-
Strong in implementing observability practices (Monitoring, Logging, Distributed Tracing etc.) preferably using Splunk and New Relic. Experience should not be limited to using the dashboards, but creating them from scratch.
-
Experience in conducting and automating DR exercise in AWS cloud thus validating RPOs and RTOs.
-
Strong understanding and working experience with AWS components.
-
Ability to design and implement API’s for use by internal teams.
Bonus Points If You Have
-
3–5 years of related experience in the software engineering industry, with experience supporting large scale software systems in production.
-
Certified in Cloud (AWS/Azure/GCP), experience in using services such as computers, containers and databases.
-
Experience in Ansible/Terraform and Python.
-
A grasp of Cloud Native concepts, containerization best practices and security awareness in Cloud will be a strong plus.
-
Experience in Observability, creating dashboards for SLA/SLI/SLO.