Senior Site Reliability Engineer
kroll
Job Description
The responsibilities include but are not limited to:
- Cloud Infrastructure Management:
- Design, deploy, and manage scalable, secure, and resilient infrastructure on Microsoft Azure.
- Implement infrastructure-as-code (IaC) using tools such as Terraform, ARM templates, or Azure Bicep to automate cloud infrastructure provisioning and management.
- Optimize cloud resources for cost, performance, and scalability.
- Reliability and Performance Engineering:
- Monitor system performance, reliability, and availability metrics across Azure services and identify areas for improvement.
- Develop and implement strategies to reduce system downtime, improve performance, and manage incidents effectively.
- Conduct root cause analysis (RCA) for incidents and drive long-term improvements to prevent recurrence.
- Automation and Tooling:
- Automate repetitive tasks and processes to improve efficiency and reduce operational overhead.
- Develop and maintain CI/CD pipelines using Azure DevOps, ensuring seamless code deployment and infrastructure updates.
- Implement and manage monitoring, alerting, and logging solutions using Azure Monitor, Log Analytics, Application Insights, or other tools.
- Security and Compliance:
- Ensure that all cloud environments adhere to security best practices, including identity and access management, encryption, and compliance with regulatory standards.
- Collaborate with security teams to implement and maintain robust security controls across all cloud resources.
- Perform regular security audits and vulnerability assessments.
- Incident Management and Response:
- Serve as a primary point of contact for cloud-related incidents, ensuring timely resolution and effective communication with stakeholders.
- Participate in on-call rotations to provide 24/7 support for critical systems and services.
- Develop runbooks, standard operating procedures (SOPs), and playbooks for incident response and recovery.
- Collaboration and Continuous Improvement:
- Work closely with development teams to integrate reliability and performance considerations into the software development lifecycle (SDLC).
- Foster a culture of continuous improvement by identifying and implementing process enhancements, automation opportunities, and best practices.
- Mentor and provide guidance to junior engineers on cloud reliability, automation, and best practices.
- Documentation and Reporting:
- Maintain comprehensive documentation of cloud infrastructure, configurations, processes, and incident reports.
- Generate regular reports on system health, performance, and reliability metrics for management and stakeholders.
- Contribute to knowledge-sharing initiatives and documentation within the team.
Requirements:
- Education:
- Bachelor’s degree in Computer Science, Information Technology, or a related field. A master’s degree is a plus.
- Experience:
- 3+ years of experience in cloud infrastructure management, with a focus on Microsoft Azure.
- Proven experience in site reliability engineering, DevOps, or cloud operations roles.
- Hands-on experience with infrastructure-as-code (IaC) tools such as Terraform, ARM templates, or Azure Bicep.
- Strong background in automation, scripting (e.g., Python, PowerShell, Bash), and CI/CD pipelines.
- Experience with monitoring, alerting, and logging tools in an Azure environment (Azure Monitor, Log Analytics, Application Insights).