Software Engineer II, Application SRE

myworkdayjobs

India 2 Years Exp Posted 48d ago

Job Description

  • This role would report to leader in Platform Engineering & Site Reliability Engineering 

  • Builds resilient cloud systems, automates operations, and ensures high reliability through robust monitoring and performance optimization

  • Leading incident response and fostering a collaborative SRE culture. They drive team evolution and proactively prevent issues to maintain optimal service levels.

What you'll do

  • Assist in building and maintaining highly available and fault-tolerant applications

  • Support the setup and maintenance of monitoring, logging, and alerting systems to enable proactive issue detection and faster resolution.

  • Contribute to automation efforts for infrastructure provisioning, configuration, and deployment to improve operational efficiency and reliability.

  • Help identify and troubleshoot performance issues, monitor system health, and support defining and tracking SLIs/SLOs under senior guidance.

  • Participate in incident response activities, assist in documenting post-incident reviews, and help implement preventive measures to improve reliability.

  • Collaborate with cross-functional teams to promote SRE practices and continuous improvement in operations.

  •  

Who you will work with

  • Co-Founders and HODs

  • Engineering Teams

  • External customers/MNOs and vendors

What we are looking for

  • 2–5 years of experience in Site Reliability Engineering within the telecom or cloud infrastructure domain, focusing on ensuring high availability and reliability of critical business applications.

  • Support the implementation of SRE best practices, including incident management, monitoring, and automation, to improve system performance and resilience.

  • Assist in designing and maintaining observability solutions using tools such as Prometheus, Grafana, New Relic, and Dynatrace for proactive monitoring and alerting.

  • Participate in incident response and root cause analysis (RCA) activities, contributing to post-mortem reviews and documentation for continuous improvement.

  • Contribute to performance optimization, including capacity analysis, load testing, and tuning of system components under guidance from senior engineers.

  • Support automation initiatives for infrastructure and deployments using Terraform, Ansible, Helm, and Kubernetes, ensuring consistency and efficiency in delivery.

  • Work with AWS, GCP, or OCI environments, assisting in building and maintaining cloud-native and hybrid architectures.

  • Partner with cross-functional teams in development, operations, and security to promote a culture of reliability, scalability, and observability.

  • Hands-on experience with project management tools such as JIRA, and a solid understanding of product lifecycle and agile methodologies.

  • Strong analytical, troubleshooting, and problem-solving skills with a passion for learning and continuous improvement.

 

Similar Openings for You