Site Reliability Engineer III
khoros
Job Description
Responsibilities :
- Manage environments on the Cloud.
- Monitor, troubleshoot, and resolve issues related to infrastructure, applications, and services.
- Monitor availability and maintain the systems in good health.
- Implement automation tools and processes to improve efficiency and reliability.
- Participate in on-call rotation and respond to incidents promptly.
- Continuously evaluate and improve our systems and processes to enhance reliability and performance.
- Document runbooks and procedures.
- Work closely with 1st Level support groups as well as Development groups.
- To follow departmental change management procedures in defining, planning, and implementing change so that service disruption is minimized and adherence to Service Level Agreements is ensured.
- Perform the Incident root cause analysis.
- Have the ability to run with projects/issues solo and work in a team environment.
- Be a Team Player – work in a collaborative team-oriented environment, share information, respect diverse ideas, and interact with customers and, partner with cross-functional and remote teams.
- Be Curious & Innovative – continuously update yourself with next-generation technology and development tools, and contribute to process development practices. Evaluate new technologies and software products to determine the feasibility and desirability of incorporating capabilities within the company's products.
- Be Agile – with a strong sense of urgency and a desire to work in a fast-paced, dynamic environment to deliver solutions against strict timelines.
Requirements:
- 4+ years experience as an SRE in fast-paced and high-traffic environments.
- Experience deploying and maintaining applications in any one of the clouds (AWS- must have, AZURE/ GCP- good to have)
- Working knowledge of Linux and Windows operating systems
- Working knowledge with any of the scripting languages - Shell, bash, python, PowerShell
- Understanding of containerization and orchestration technologies (e.g., Docker, Kubernetes).
- Working knowledge with Jenkins, Ansible, Terraform, and ArgoCD (good to have)
- Administration of databases (MS SQL, MongoDB, etc)
- Extensive experience with some monitoring, logging, and observability tools ( Sumo, DD, AWS CloudWatch, AWS X-Ray, New Relic, Splunk, etc.)
- Ability to debug issues and solve problems
- Excellent problem-solving and communication skills.
- Ability to work independently and collaborate effectively in a team environment.
- Familiarity with agile development methodologies is a plus.