Senior Site Reliability Engineering

nvidia

Bengaluru, India 12 Years Exp Posted 4d ago

Job Description

  • Lead design, deployment, and operations of production NAS, SAN, and Object Storage platforms, ensuring reliability, performance, and security.

  • Capture requirements from partner teams, architect storage solutions, and drive end‑to‑end implementation for new and existing services.

  • Develop, maintain, and improve automation for provisioning, configuration, monitoring, incident response, and lifecycle management of storage infrastructure.

  • Participate in on‑call and incident response, lead troubleshooting of complex storage and performance issues, and drive root cause analysis and preventive actions.

  • Define and track SLOs/SLIs and error budgets for storage services, using observability and analytics to continuously improve reliability and efficiency.

  • Build and maintain runbooks, standard operating procedures, and comprehensive documentation for storage services and automation.

  • Analyze capacity and usage trends, perform forecasting, and recommend scaling or optimization strategies to support business growth.

  • Collaborate closely with SRE, infrastructure, networking, and application teams in a follow‑the‑sun model to deliver consistent, high‑quality service.

    • Mentor junior engineers, share best practices, and help drive adoption of SRE principles across the team.

Similar Openings for You