Site Reliability Engineer
wsa
Job Description
Key Responsibilities:
- Ensure System Uptime and Reliability: Monitor and maintain cloud-based applications and infrastructure, ensuring minimal downtime and efficient incident response.
- Build and Optimize Monitoring and Alerting Systems: Set up and continuously improve comprehensive monitoring and alerting frameworks to detect and address issues proactively.
- Cloud Infrastructure Management: Manage, optimize, and scale systems on Azure cloud platforms, ensuring high performance and cost-effectiveness.
- Incident Management and Response: Act as the first line of defense in identifying, diagnosing, and resolving technical issues in real-time or escalate them to the appropriate teams.
- Automation and Infrastructure as Code (IaC): Utilize IaC tools to automate infrastructure provisioning and management, promoting reproducibility and reducing manual interventions.
- Tooling and Observability: Leverage technologies such as Grafana for observability and Argo for CI/CD automation, enhancing our ability to respond swiftly and effectively to infrastructure needs.
- Collaboration: Work closely with cross-functional teams to align on SRE best practices, share insights, and support development and operational goals.
Requirements:
- Experience with Cloud Platforms: 5+ years of experience in cloud environments, with a primary focus on Azure.
- Monitoring and Alerting Skills: Strong experience with monitoring tools (e.g., Grafana, Prometheus) and a background in setting up alerts and dashboards.
- Incident Management: Proven track record in diagnosing and troubleshooting complex system issues, with a focus on fast incident response and resolution.
- Collaboration and Communication: Excellent communication skills, with an ability to work collaboratively with various technical teams and stakeholders.
- Kubernetes Expertise: Proficiency with Kubernetes (K8s) for orchestrating and managing containerized applications.
- Automation and IaC: Hands-on experience with any Scripting language (e.g., Python, Shell script, Power shell) Infrastructure as Code (e.g., Terraform, Ansible) for automating cloud infrastructure management.
Preferred Qualifications:
- Familiarity with CI/CD tools, particularly Argo for workflow automation.
- Certification in Azure, AWS, or Kubernetes.
- Experience working in an SRE or DevOps capacity in a multi-cloud environment.