Senior Site Reliability Engineer

kroll

Hyderabad 3 Years Exp Posted 454d ago

Job Description

The responsibilities include but are not limited to:

Cloud Infrastructure Management:
- Design, deploy, and manage scalable, secure, and resilient infrastructure on Microsoft Azure.
- Implement infrastructure-as-code (IaC) using tools such as Terraform, ARM templates, or Azure Bicep to automate cloud infrastructure provisioning and management.
- Optimize cloud resources for cost, performance, and scalability.
Reliability and Performance Engineering:
- Monitor system performance, reliability, and availability metrics across Azure services and identify areas for improvement.
- Develop and implement strategies to reduce system downtime, improve performance, and manage incidents effectively.
- Conduct root cause analysis (RCA) for incidents and drive long-term improvements to prevent recurrence.
Automation and Tooling:
- Automate repetitive tasks and processes to improve efficiency and reduce operational overhead.
- Develop and maintain CI/CD pipelines using Azure DevOps, ensuring seamless code deployment and infrastructure updates.
- Implement and manage monitoring, alerting, and logging solutions using Azure Monitor, Log Analytics, Application Insights, or other tools.
Security and Compliance:
- Ensure that all cloud environments adhere to security best practices, including identity and access management, encryption, and compliance with regulatory standards.
- Collaborate with security teams to implement and maintain robust security controls across all cloud resources.
- Perform regular security audits and vulnerability assessments.
Incident Management and Response:
- Serve as a primary point of contact for cloud-related incidents, ensuring timely resolution and effective communication with stakeholders.
- Participate in on-call rotations to provide 24/7 support for critical systems and services.
- Develop runbooks, standard operating procedures (SOPs), and playbooks for incident response and recovery.
Collaboration and Continuous Improvement:
- Work closely with development teams to integrate reliability and performance considerations into the software development lifecycle (SDLC).
- Foster a culture of continuous improvement by identifying and implementing process enhancements, automation opportunities, and best practices.
- Mentor and provide guidance to junior engineers on cloud reliability, automation, and best practices.
Documentation and Reporting:
- Maintain comprehensive documentation of cloud infrastructure, configurations, processes, and incident reports.
- Generate regular reports on system health, performance, and reliability metrics for management and stakeholders.
- Contribute to knowledge-sharing initiatives and documentation within the team.

Requirements:

Education:
- Bachelor’s degree in Computer Science, Information Technology, or a related field. A master’s degree is a plus.
Experience:
- 3+ years of experience in cloud infrastructure management, with a focus on Microsoft Azure.
- Proven experience in site reliability engineering, DevOps, or cloud operations roles.
- Hands-on experience with infrastructure-as-code (IaC) tools such as Terraform, ARM templates, or Azure Bicep.
- Strong background in automation, scripting (e.g., Python, PowerShell, Bash), and CI/CD pipelines.
- Experience with monitoring, alerting, and logging tools in an Azure environment (Azure Monitor, Log Analytics, Application Insights).

Senior Site Reliability Engineer

Job Description

Similar Openings for You

Data Engineer

Data Engineer

Data Engineer

Data Engineer