Site Reliability Engineer
thomsonreuters
Job Description
In this opportunity as Site Reliability Engineer, you will:
- Work with application teams to manage and support applications into production
- Continuous improvement to an on-going support model including release and change management for maintaining the strategic environments (i.e. production, non-production etc.)
- Provide well-written documentation and technical presentations on projects supported by the team.
- Provide problem management services by utilizing diagnostic and debugging tools to aid in troubleshooting efforts, including 24x7 rotating pager support.
- Coordinate the implementation of application monitoring, establish support documentation, and provide training on products and procedures.
- Provide technical assistance on the troubleshooting, and performance tuning of the supported environment(s)
About You
You're a fit for the role of Site Reliability Engineer if your background includes
- 3-5 years of experience in an enterprise-level operations support role, SRE, or DevOps role.
- Working knowledge of infrastructure components (e.g., routers, load balancers, cloud products, container systems, compute, storage, and networks)
- Expertise in observability and monitoring tools, like Datadog, AppDynamics, Splunk, etc.
- Deep understanding of Application performance monitoring (APM) and user monitoring.
- Knowledge of Infrastructure as Code (IaC): AWS Cloud Formation, Ansible, Terraform, etc. Apply standards of cloud compliance to application design to achieve reliability
- Experience in site reliability engineering in Dotnet, Java, Kubernetes, and Database platforms (like Postgres)
- Experience with Load balancers and AWS services such as AWS ECS, EMR, State Machines/ Step Functions, CloudFormation, CloudWatch, Lambda, SQS, ECR, Fargate, Elastic Search, networking concepts, etc.,
- Sound knowledge of ITSM process, SI/SLO/SLA management, incident resolution, and automation techniques
- Strong IP networking fundamentals and experience with usage of standard application protocols and messages (e.g., TCP/IP, HTTP, SOAP, RESTful APIs, XML/JSON, JDBC, JMS/MQ)
- Ability to analyze application and server logs, error interpretation.
- Incident response and recovery: SREs are responsible for responding to incidents and implementing processes for incident response, monitoring, and automated recovery.
- Scripting knowledge in Poweshell, Bash, shell scripting
- Ability to code in one of the programming languages (Java, C#, Python, JavaScript, etc.)
- Working knowledge of ITIL Change and Incident management processes.
- Excellent written and verbal communication skills and strong collaboration skills.