Sr Site Reliability Engineer
geaerospace
Job Description
Role overview:
- Understand business requirements and collaborate with Product & DevOps teams to implement highly available, scalable, resilient, cost-efficient solutions in Cloud environments.
- Deploy Observability tools (New Relic, Splunk, ELK, Other open source O11y tools..etc) in our Cloud infrastructure and applications via Terraform and be the SME for these tools.
- Create and configure alerts, dashboards, reports mapping to the Golden signals – Latency, Errors, Traffic, Saturation.
- Pioneer the definitions of SLIs, SLOs and Error Budgets for GE Aerospace Digital Workplace’s products and services. And, champion the implementation for large scale adoption.
- Perform Root Cause Analysis (RCA) for SLO breaches, Alerts and Incidents. Front-end the troubleshooting and debugging sessions.
- Solve problems relating to critical products, applications, services and create solutions (automations, runbooks..etc.) to prevent problem recurrence.
- Lead the Incident Management + Postmortem processes and collaborate with the Operations team to develop the templates for comms, runbooks and documents.
- Consistently share best practices for reliability, resiliency, performance, and improve processes within and across teams.
- Execute data driven approach to make decisions around capacity needs, Cloud cost optimization and infrastructure stability.
- Prioritize reducing MTTx (Mean Time to Recover/Resolve/Repair) for Production incidents to provide better user experience.
- Propose new design and develop solutions to solve complex problems in application resiliency and availability.
- Be a strong technical mentor for junior team members professionally to help them realize their full potential.
Required Qualifications/ Requirements:
- Bachelor’s degree from a recognized university or college with a minimum of 4 years of professional experience OR Diploma with a minimum of 5 years of professional experience OR Higher Secondary Certificate with a minimum of 7 years of professional experience
Preferred Qualification:
- A minimum of 2 years of experience in Production Engineering or Site Reliability Engineering roles.
- A minimum of 2 years of experience in Cloud environments (e.g., AWS, Azure) is required.
- A minimum of 2 years of experience in DevOps and Infrastructure domain.