Senior Site Reliability Engineer (SRE) Lead
Siemens
Job Description
Responsibilities
- Incident Management, Game Day coordination,
- Create and drive Metric/observability solutions and reviews
- Support production readiness reviews
- Cross division role model to advance the SRE practice in Siemens
- Complete technological control over methods of automation, codifying optional activities, microservice architecture, platform engineering to ensure changes, updates or technical advancements are in place for a product
- Ensure the team can provide the design, deployment, automation, and scripting solutions to drive new capabilities, visibility, and efficiency
- Simplify highly complex ideas, architectures and concepts to encourage achievable adoption
- Collaborate with other technical platforms and partners to engineer automated and integrated solutions between tools, services, teams that increase availability, reliability, and performance
- Own and ensure the internal and external SLA’s meet and exceed expectations
- Be part of maintaining a 24x7, global, highly available SaaS environment
- Participate in an on-call rotation that supports our production infrastructure
- Troubleshoot production availability incidents that often span across multiple teams and services
- Ensure the SRE team can coordinate production incident post-mortems, and contribute to solutions to prevent problem recurrence; with the goal of automated response to all non-exceptional service conditions
- Communicate to business and technical partners on incidents as they occur when they impact system performance or availability at a critical level
Required Knowledge/Skills, Education, and Experience
-
Bachelor’s Degree or equivalent experience;
-
Proven experience as a Site Reliability Engineer or equivalent role;
-
Experience working in a large organization though a SRE transformation where existing applications were adapted to contemporary targets
-
Proven experience with automation via scripting & API development
-
Experience with software development in the cloud
-
Experience with monitoring tools (Datadog, CloudWatch, CloudTrail, Cloudability, or equivalent tools)
-
Proven experience with containerization, specifically Kubernetes
-
Experience with Amazon Web Services (AWS) services and Terraform, CloudFormation, Ansible, or equivalent tools
Preferred Knowledge/Skills, Education, and Experience
-
Desired certifications include: Datadog, Kubernetes, Security, AWS certification
-
Understanding of ITIL
-
Deep understanding of SRE and Incident management strategies
-
Experience with issue/incident tracking tool (ServiceNOW, ServiceDesk, Jira or equivalent tools) and open source tools (Linux, Python, Git, Ansible)
-
Experience on Enterprise IT environment with distributed environments
-
Networking concepts, including firewalls, VPN, routing, load balancers, security and DNS
-
Senior level system administration experience, including troubleshooting, support, mentorship/training, and oversight
-