Site Reliability Engineer
catchpoint
Job Description
Responsibilities
- Define and refine the whole service lifecycle - from inception and design, through deployment, operation and finally retirement.
- Assess services once they are live by measuring and monitoring availability, latency and overall system health. Establish performance baselines, define actions and automations based on data correlated from multiple sources.
- Design, build, and maintain logging and telemetry systems that are used to manage all services.
- Design, code, test, and deliver software to automate manual operational work.
- Troubleshoot priority incidents, facilitate blameless post-mortems and ensure permanent closure of incidents.
- Identify application patterns and analytics in support of better service level objectives.
- Deploy and maintain systems that run on multiple cloud providers (AWS, GCP, Azure, Alibaba, Tencent, Oracle, IBM) and physical systems around the world.
- Be part of an on-call rotation to support production systems.
Required Skills & Qualifications
- Strong Experience/knowledge of administering application servers, web servers, and databases.
- Familiarity with Infrastructure Automation, configuration management and CI/CD tools (preferably terraform)
- Experience with multiple cloud platforms (AWS, GCP, Azure)
- Good networking knowledge and experience with Internet Architecture (BGP, peering, DNS).
- 2+ years of incident resolution experience in a large-scale operations environment.
- Hands-on experience with cloud deployment, monitoring, and ops analysis tools such as Prometheus, Elasticsearch, Grafana, Kibana, Splunk, Terraform, Jenkins, etc.
- 3+ years programming experience with python, bash, PowerShell, C, etc.
- Virtualization experience required.
- BS degree in Computer Science or related technical field involving coding or equivalent practical experience.
- Appreciation of the value of diversity of opinions