Site Reliability Engineer
ebayinc
Job Description
What you will accomplish:
- Proactive Monitoring: Continuously monitor the health of eBay's critical services to identify and address potential issues before they escalate.
- Solution Development: Collaborate with Architecture, Engineering, and Operations teams to develop solutions that ensure high site availability, reliability and performance.
- Collaborative Problem Solving: Work closely with partner teams to resolve recurring technical issues, onboard new alerts, and develop high-quality Standard Operating Procedures (SOPs).
- Enhance Monitoring Tools: Build and improve tools for monitoring and mitigating site incidents, and conduct reliability audits and tests to strengthen eBay’s reliability and incident management capabilities.
- Incident Management: Act as Incident Commander to drive resolution of major incidents, manage alarms, and ensure effective communication with leadership and partner teams.
What you will bring:
- 4+ years of professional experience in software engineering, ideally in backend or platform teams
- Proficiency in one or more programming languages (e.g., Java, Go, Python)
- Strong incident management and leadership skills, with excellent technical triage and troubleshooting abilities, especially during crises.
- Familiarity with cloud platforms, container orchestration (e.g., Kubernetes), and infrastructure-as-code tools
- Experience with observability stacks (e.g., Prometheus, Grafana, ELK, OpenTelemetry)
- Strong interpersonal and communication skills to thrive in fast-paced, dynamic environments.