Software Engineer II, Application SRE
myworkdayjobs
Job Description
-
This role would report to leader in Platform Engineering & Site Reliability Engineering
-
Builds resilient cloud systems, automates operations, and ensures high reliability through robust monitoring and performance optimization
-
Leading incident response and fostering a collaborative SRE culture. They drive team evolution and proactively prevent issues to maintain optimal service levels.
What you'll do
-
Assist in building and maintaining highly available and fault-tolerant applications
-
Support the setup and maintenance of monitoring, logging, and alerting systems to enable proactive issue detection and faster resolution.
-
Contribute to automation efforts for infrastructure provisioning, configuration, and deployment to improve operational efficiency and reliability.
-
Help identify and troubleshoot performance issues, monitor system health, and support defining and tracking SLIs/SLOs under senior guidance.
-
Participate in incident response activities, assist in documenting post-incident reviews, and help implement preventive measures to improve reliability.
-
Collaborate with cross-functional teams to promote SRE practices and continuous improvement in operations.
-
Who you will work with
-
Co-Founders and HODs
-
Engineering Teams
-
External customers/MNOs and vendors
What we are looking for
-
2–5 years of experience in Site Reliability Engineering within the telecom or cloud infrastructure domain, focusing on ensuring high availability and reliability of critical business applications.
-
Support the implementation of SRE best practices, including incident management, monitoring, and automation, to improve system performance and resilience.
-
Assist in designing and maintaining observability solutions using tools such as Prometheus, Grafana, New Relic, and Dynatrace for proactive monitoring and alerting.
-
Participate in incident response and root cause analysis (RCA) activities, contributing to post-mortem reviews and documentation for continuous improvement.
-
Contribute to performance optimization, including capacity analysis, load testing, and tuning of system components under guidance from senior engineers.
-
Support automation initiatives for infrastructure and deployments using Terraform, Ansible, Helm, and Kubernetes, ensuring consistency and efficiency in delivery.
-
Work with AWS, GCP, or OCI environments, assisting in building and maintaining cloud-native and hybrid architectures.
-
Partner with cross-functional teams in development, operations, and security to promote a culture of reliability, scalability, and observability.
-
Hands-on experience with project management tools such as JIRA, and a solid understanding of product lifecycle and agile methodologies.
-
Strong analytical, troubleshooting, and problem-solving skills with a passion for learning and continuous improvement.