Senior Site Reliability Engineer

kroll

Hyderabad 3 Years Exp Posted 454d ago

Job Description

The responsibilities include but are not limited to:

  • Cloud Infrastructure Management:
    • Design, deploy, and manage scalable, secure, and resilient infrastructure on Microsoft Azure.
    • Implement infrastructure-as-code (IaC) using tools such as Terraform, ARM templates, or Azure Bicep to automate cloud infrastructure provisioning and management.
    • Optimize cloud resources for cost, performance, and scalability.
  • Reliability and Performance Engineering:
    • Monitor system performance, reliability, and availability metrics across Azure services and identify areas for improvement.
    • Develop and implement strategies to reduce system downtime, improve performance, and manage incidents effectively.
    • Conduct root cause analysis (RCA) for incidents and drive long-term improvements to prevent recurrence.
  • Automation and Tooling:
    • Automate repetitive tasks and processes to improve efficiency and reduce operational overhead.
    • Develop and maintain CI/CD pipelines using Azure DevOps, ensuring seamless code deployment and infrastructure updates.
    • Implement and manage monitoring, alerting, and logging solutions using Azure Monitor, Log Analytics, Application Insights, or other tools.
  • Security and Compliance:
    • Ensure that all cloud environments adhere to security best practices, including identity and access management, encryption, and compliance with regulatory standards.
    • Collaborate with security teams to implement and maintain robust security controls across all cloud resources.
    • Perform regular security audits and vulnerability assessments.
  • Incident Management and Response:
    • Serve as a primary point of contact for cloud-related incidents, ensuring timely resolution and effective communication with stakeholders.
    • Participate in on-call rotations to provide 24/7 support for critical systems and services.
    • Develop runbooks, standard operating procedures (SOPs), and playbooks for incident response and recovery.
  • Collaboration and Continuous Improvement:
    • Work closely with development teams to integrate reliability and performance considerations into the software development lifecycle (SDLC).
    • Foster a culture of continuous improvement by identifying and implementing process enhancements, automation opportunities, and best practices.
    • Mentor and provide guidance to junior engineers on cloud reliability, automation, and best practices.
  • Documentation and Reporting:
    • Maintain comprehensive documentation of cloud infrastructure, configurations, processes, and incident reports.
    • Generate regular reports on system health, performance, and reliability metrics for management and stakeholders.
    • Contribute to knowledge-sharing initiatives and documentation within the team.

 

Requirements:

  • Education:
    • Bachelor’s degree in Computer Science, Information Technology, or a related field. A master’s degree is a plus.
  • Experience:
    • 3+ years of experience in cloud infrastructure management, with a focus on Microsoft Azure.
    • Proven experience in site reliability engineering, DevOps, or cloud operations roles.
    • Hands-on experience with infrastructure-as-code (IaC) tools such as Terraform, ARM templates, or Azure Bicep.
    • Strong background in automation, scripting (e.g., Python, PowerShell, Bash), and CI/CD pipelines.
    • Experience with monitoring, alerting, and logging tools in an Azure environment (Azure Monitor, Log Analytics, Application Insights).

Similar Openings for You