Site Reliability Engineer
netradyne
Job Description
Role and Responsibilities:
- Participate in an on-call rotation for incident response and implement proactive measures to prevent incidents.
- Develop monitoring alerts and incident response processes to ensure high availability and reliability.
- Document actions taken during incidents and create automated solutions to improve incident response.
- Collaborate with the engineering team as an expert in reliability, performance, and efficiency to support ongoing projects.
- Consistently deliver high-quality managed services, ensuring optimal uptime and scalability of infrastructure, applications, and cloud services.
- Automate the detection and resolution of recurring issues to enhance system stability.
- Build tools and automation frameworks to eliminate repetitive tasks and prevent incident occurrence.
- Continuously improve engineering, operational processes, and team practices to enhance efficiency and productivity.
- Demonstrate strong programming skills and a deep understanding of systems to support the reliability and scalability of services.
- Foster a culture of continuous improvement by promoting process changes and best practices.
- Engage in continuous learning to expand skills through experimentation or training.
Soft Skills:
- Ability to work asynchronously and independently.
- Strong collaboration skills and willingness to work as part of a team.
- Excellent problem-solving skills with the ability to think clearly under pressure.
- Strong analytical and management skills.
- Effective communication and documentation skills.
Qualifications:
- Bachelor's or Graduate degree in Computer Engineering, Computer Science, Engineering, Information Systems Management, or equivalent experience.
- Experience with Monitoring/Observability/Log tools such as AWS CloudWatch, Datadog, Prometheus/Grafana, and ELK.
- Proficiency with Public Cloud platforms, LINUX/UNIX environments, and programming languages such as Java, Python, or Go.
- Familiarity with Agile methodologies, SaaS environments, RDBMS, NoSQL databases, Cloud Architecture, and Frontend/Backend Systems and tools.
- Comfortable with scripting and debugging production systems and services.
- Strong collaboration skills with a mindset for continuous improvement.
- Expertise in scalability and root cause analysis exercises.