Reliability Engineer (SRE)
pepsicojobs
Job Description
Key Responsibilities:
- Lead SRE operations across distributed teams, ensuring system reliability, scalability, and performance.
- Design and implement robust monitoring, alerting, and observability frameworks.
- Lead Scrum ceremonies
- Manage and optimize Active Directory (AD) group structures and access controls.
- Collaborate with data engineering teams to support Databricks environments.
- Contribute to architectural discussions and decisions for high-availability systems.
- Drive incident response, root cause analysis, and continuous improvement initiatives.
- Integrate and manage workflows using Clarity PPM and ServiceNow for change, incident, and problem management.
- Actively participate in Scrum ceremonies (daily stand-ups, sprint planning, reviews, retrospectives).
- Collaborate with Product Owners and Scrum Masters to ensure timely and quality.
Qualifications
Education:
- Bachelor’s or Master’s degree in Computer Science, Information Systems, Business Analytics, or a related field.
Experience:
-
- 6+ years of experience in SRE, DevOps, or Infrastructure Engineering roles.
- Strong analytical thinking and troubleshooting skills.
- Hands-on experience with:
- Active Directory (AD): group policy management, access provisioning.
- Databricks: cluster management, job orchestration, performance tuning.
- Architecture: designing scalable, fault-tolerant systems.
- Clarity PPM: project tracking, resource planning.
- ServiceNow: incident/change/problem management workflows.
- Proficiency in monitoring tools (e.g., Prometheus, Grafana, Datadog).
- Experience with CI/CD pipelines and infrastructure as code (Terraform, Ansible).
- Familiarity with cloud platforms (Azure, AWS, or GCP).
- Strong scripting skills (Python, Bash, PowerShell).
- Solid understanding of Agile/Scrum methodologies and tools like Jira or Azure DevOps.