Site Reliability Engineer
viasat
Job Description
What you'll do
- Contribute as part of a distributed team of Production Engineers that improve the reliability and performance of the production network
- Troubleshoot and debug the hardest problems, live, in our ever-growing large-scale distributed infrastructure and production environment
The day-to-day
- Share an on-call rotation in a follow-the-sun 24x7 support model for the reliability of service
- Apply software engineering and networking knowledge to build tools and data platforms that reduce time to restore and eliminate toil
- Develop strong relationships with partner engineering teams to develop knowledge base and drive operational excellence
What you'll need
- 4+ years of software development experience
- Experience coding in one or more of the following programming languages: Python, C/C++, Java, and/or Go
- Understanding of communication protocol and networking fundamentals
- Experience designing, analyzing, and troubleshooting large-scale distributed systems
- Systematic problem-solving approach
- Excellent verbal and written communication skills
- BS or MS in Computer Science, a similar technical field, or equivalent practical experience
What will help you on the job
- Experience as a Site Reliability Engineer (SRE) or Production Engineer (PE)
- Knowledge of satellite communication or wireless network technologies (WiMAX, LTE)
- Experience of Unix/Linux operating system administration
- Experience managing cloud infrastructure and developing operational processes
- Experience working with Kubernetes
- Experience in automation and machine learning (ML)
- Experience building operational runbooks and monitoring dashboards (e.g. Splunk, Grafana)
- Experience with Continuous Integration (CI) / Continuous Deployment (CD) / DevOps best practices
- Experience with cloud computing technologies such as AWS, Google Cloud, Microsoft Azure