Site Reliability Engineer - O/S
spglobal
Job Description
Responsibilities:
- Gather and analyze metrics from operating systems as well as applications to assist in performance tuning and fault finding.
- Partner with development teams to improve services through rigorous testing and release procedures.
- Participate in system design consulting, platform management, and capacity planning.
- Create sustainable systems and services through automation.
- Balance feature development speed and reliability with well-defined service-level objective
- Day to day management of VMC/AWS Infrastructure
- Build and document automation processes for Infrastructure as a Service/Infrastructure as code.
- Backup and Patch management
What We’re Looking For:
We are looking for someone who has.
- Bachelor’s degree (or equivalent) in computer science or related discipline with at least 7 years of experience
- Proactive approach to identifying problems, performance bottlenecks, and areas for improvement.
- Strong interpersonal skills, analytical and problem-solving ability along with strong written and verbal communication.
- Solid understanding and hands-on experience with container orchestration.
This role is responsible for configuring, deploying, maintaining, troubleshooting, and monitoring container orchestration on AWS. - Ability to communicate ideas in both technical and non-technical ways.
- A strong capacity for teamwork and a sense of ownership and able to work independently and be self-driven.
- Hands on Experience with Linux Server, AD, LDAP, DNS, Network Storage, AWS Compute services (EC2, FSX, Managed AD, Route 53, etc…)
- Ability to program using scripting with tools or languages, such as PowerShell, Python, Ansible, Terraform and Bash
- Familiarity with ITSM processes like Incident, Problem and Change Management using ServiceNow (preferable)