Observability Engineer
blackline
Job Description
You'll Get To:
- Ensure 99.99%+ availability of the services and infrastructure that spans across multiple global datacentres in private and public clouds.
- Troubleshoot BL container platforms and supporting automation in a highly available, high traffic environment.
- Monitor and maintain health, performance, and security of all infrastructure components.
- Build systems and perform necessary tasks to deliver against committed project timelines. Desire to automate everything
- Solve real-life problems in a bleeding-edge, high-performance, and high-traffic environment. Maintain documentation and operational knowledge base.
- Triaging first level events and incidents.
- Adhere to the change management and other established processes and procedures.
- Respond to and troubleshoot incidents (Incident Management). Conduct root cause analyses.
- Evaluate and analyse systems, performance, issues and metrics in order to provide recommendations for continuous improvements.
- Adhere to SLA compliance as defined.
- Participate in a scheduled 24/7 on-call rotation for second tier support escalations.
- Should be willing to work 3 days from office.
What You'll Bring:
- 3 - 6 years industry experience
- 3+ years supporting Unix and/or Linux (Ubuntu, CentOS, Redhat) and/or Windows
- 3+ years supporting a SaaS/Hosting type critical revenue-generating environment.
- 2+ years working with development and continuous integration related tooling (Jenkins, BitBucket, GitHub)
- 2+ years working with tools like New Relic, Jira, Foglight.
- 1+ years of experience using container platforms and tooling (Kubernetes, Docker, Rancher, Helm, Anthos, Istio, GKE, AKS, etc...)
- Experience in hybrid cloud and/or multi-cloud environments (GCP (primary), Azure, AWS, VMWARE)
- Understanding of software development processes and methodologies.
- Experience with scripting and/or systems programming languages (Bash, PowerShell, Python, Golang, C#).
- Hands-on problem-solving skills, technical leadership and mentoring qualities.
- Strong written and oral communication skills.
- Ability to participate in On-Call rotation
- A minimum of two years of experience in a 24x7 operations organization, deploying and operating complex cloud infrastructure at scale
- 3 days hybrid mandatory.