Site Reliability Engineer
pubmatic
Job Description
Responsibilities:
- Operational Support
- Be a primary point of contact for operational support of multiple large-scale distributed software applications in the Ad Server environment.
- Monitor availability of applications, promptly detect anomalies, analyze the impact, debug the problems in production, and follow up for the resolution by working closely with the engineering team.
- Maintain services once they are live by measuring and monitoring availability, latency, and overall system health.
- Diligently work with the engineering team to expedite the resolution of incidents and ensure a swift return to normal operations.
- Be innovative in building dashboards, adding metrics, writing automation scripts to reduce operation toil, and streamlining processes to enhance system reliability and stability.
- Design and construct software and systems to effectively manage the Ad Serving platform, its underlying infrastructure, and applications.
- On Call Availability and Support
- Work in shifts to provide continuous on-call support for the production systems and resolve issues on your own by using predefined handbooks
- Show a sense of urgency for high-priority issues and arrange war rooms to resolve the problems.
- Provide timely updates for high-priority issues and do handovers when a problem needs to be worked out 24*7
- Conduct post-incident reviews to identify root causes, recommend preventive measures, and contribute to a culture of learning and improvement.
Requirements:
- Bachelor's degree in computer science or related disciplines
- Total 3+ years' experience in software development
- Ability to program using programming languages like C or C++, Scripting languages like Shell or Python
- Good to have prior experience in technical engineering
- A proactive approach to identify the problems, performance bottlenecks, and areas of improvement
- Must know, Networking, Database (MySQL) and Linux System concepts, Debugging and analyzing the core dumps
- Hands-on experience with monitoring and observability tools like Grafana, Nagios, Influx, ELK, etc.
- Familiarity with orchestration tools like Docker and Grafana and incident management systems like Zenduty
- Excellent communication and collaboration skills, with the ability to work effectively across teams.
- Self-motivated and positive mindset to examine any incidents