SRE, Staff
synopsys
Job Description
Key Roles & Responsibilities
- Discover, design, implement changes to existing IT infrastructure with the focus of improved reliability, performance, and standardization.
- Collaborate with Engineering and business units to translate customer, business, and technical requirements into SRE practices and enhancements.
- Ensure efficient resource utilization and continuously improve processes leveraging automation and internal tools resulting in enhanced service delivery, maturity, and scalability.
- Troubleshoot production issues providing root cause analysis and designing solutions to prevent future occurrences.
- Monitoring of services and creating intelligent alarming for quicker incident detection and resolution.
- Maintain vulnerability management processes and policies using a risk-based priority methodology.
- Collaborate with the various teams and platform owners on all vulnerability management and reporting.
- Strategically apply architectural and infrastructure disciplines to solve business problems.
- Participate in off-hours maintenance activities and be part of on-call rotation schedule.
Required Skills
- Extensive experience with a wide range of infrastructure technologies, such as but not limited to Linux, Windows, High-performance computing, storage platforms, networking, cloud computing, cloud services (IaaS, PaaS, SaaS, etc.), virtualization, OpenStack, containerization, and orchestration technologies (e.g., Docker, Kubernetes).
- Deep understanding of IT infrastructure related services and their dependencies required to troubleshoot issues and define mitigations.
- Solid experience with the administration, security hardening, and performance tuning of Linux and Windows OS. In-depth knowledge of CIS benchmarking standards.
- Experience with developing service level indicators and objectives, instrumenting software, and building alerts.
- An understanding of software engineering fundamentals with experience developing software with a team of engineers. Strong experience in the practice of testing.
- Experience with the operations, administration, and development of orchestration systems such as Kubernetes, ECS, Mesos.
- Passion for tracking down technical root causes of distributed systems, and software.
- Experience with ITAM, Service Mapping, and CMDB (service-now)
- Strong technical foundation, with the ability to engage deeply on technical topics related to data centre and cloud infrastructure, software reliability, and operational practices.
- Proficiency in ITIL (Information Technology Infrastructure Library) processes and frameworks
- Service availability-oriented mindset with a pro-active approach to problem solving. An ideal candidate should be able to develop automated solutions to prevent recurring problems.
- Possesses the ability and willingness to challenge the status-quo and optimize current processes and procedures.
Experience & Education
- Masters/bachelor’s degree with minimum of 8+ years of experience in IT infrastructure & operations with 4+ years in an SRE role
- Implementation experience in infra-automation tools and frameworks like GitHub, Jenkins, Terraform (IaC), Ansible, Shell scripting.
- Hands on experience with one or more of Java/Python/Go/NodeJS languages.
- Knowledge of SDLC, Agile processes and CI/CD tools.
- Well versed in ITIL process including incident, request and change management.
- Good understanding of cloud, automation, networking and SIEM tools.
- Excellent verbal and written communication skills
- Excellent problem-solving skills and ability to work through issues and challenges.