Site Reliability Engineer

catchpoint

India 3 Years Exp Posted 463d ago

Responsibilities

Define and refine the whole service lifecycle - from inception and design, through deployment, operation and finally retirement.
Assess services once they are live by measuring and monitoring availability, latency and overall system health. Establish performance baselines, define actions and automations based on data correlated from multiple sources.
Design, build, and maintain logging and telemetry systems that are used to manage all services.
Design, code, test, and deliver software to automate manual operational work.
Troubleshoot priority incidents, facilitate blameless post-mortems and ensure permanent closure of incidents.
Identify application patterns and analytics in support of better service level objectives.
Deploy and maintain systems that run on multiple cloud providers (AWS, GCP, Azure, Alibaba, Tencent, Oracle, IBM) and physical systems around the world.
Be part of an on-call rotation to support production systems.

Required Skills & Qualifications

Strong Experience/knowledge of administering application servers, web servers, and databases.
Familiarity with Infrastructure Automation, configuration management and CI/CD tools (preferably terraform)
Experience with multiple cloud platforms (AWS, GCP, Azure)
Good networking knowledge and experience with Internet Architecture (BGP, peering, DNS).
2+ years of incident resolution experience in a large-scale operations environment.
Hands-on experience with cloud deployment, monitoring, and ops analysis tools such as Prometheus, Elasticsearch, Grafana, Kibana, Splunk, Terraform, Jenkins, etc.
3+ years programming experience with python, bash, PowerShell, C, etc.
Virtualization experience required. 
BS degree in Computer Science or related technical field involving coding or equivalent practical experience.
Appreciation of the value of diversity of opinions