Architect - Site Reliability Engineer- Compute
pepsicojobs
Job Description
Responsibilities
- Monitor and Respond: Proactively monitor compute infrastructure health and performance, identify potential issues, and respond quickly to incidents.
- Automate and Optimize: Develop and implement automation tools to streamline compute operations, improve efficiency, and reduce manual intervention.
- Collaborate and Troubleshoot: Work closely with software engineering, platform, and other teams to troubleshoot complex compute problems and implement solutions.
- Capacity Planning: Analyze compute resource usage and trends to forecast capacity needs and ensure sufficient resources are available to meet demand.
- Document and Communicate: Maintain accurate and up-to-date documentation of compute configurations, procedures, and incidents.
- Participate in On-Call Rotation: Provide 24/7 on-call support for critical compute incidents.
Qualifications
- Experience: 9+ years of experience in systems engineering or operations, with a focus on SRE principles and practices.
- Technical Skills: Deep understanding of operating systems (Linux, Windows), virtualization technologies, Storage and Back Up systems including container orchestration platforms (Kubernetes, Docker).
- Problem-Solving: Strong analytical and problem-solving skills, with the ability to identify and resolve complex compute issues.
- Communication: Excellent written and verbal communication skills, with the ability to collaborate effectively with cross-functional teams.
- Adaptability: Ability to thrive in a fast-paced, dynamic environment, and adapt to changing priorities.