Site Reliability Engineer
apple
Job Description
- Design, develop, and automate: Build tools, frameworks and solutions to improve reliability, scalability, and efficiency across large scale distributed data platform systems.
- Monitor and maintain: Implement advanced monitoring and alerting for on-prem , cloud and workloads.
- Troubleshoot and solve: Support critical applications including analytics, reporting, and AI/ML apps. Respond to and resolve complex production incidents, and perform root cause analysis.
- Collaborate: Work closely with development and operations teams to integrate reliability best practices throughout the software lifecycle.
- Optimize: Proactively recommend improvements in architecture, deployment, and operations for distributed systems
Minimum Qualifications
- Experience: 5+ years in software site reliability engineering or software development roles.
- Programming: Proficient in at least one of Python, Golang, or Java.
- Skilled at coding for distributed systems and developing resilient data pipelines.
- Cloud Platforms: Hands-on experience with at least one major cloud platform (AWS, Azure, or Google Cloud Platform).
Preferred Qualifications
- Expertise in designing, building, and operating critical, large-scale distributed systems with a focus on low latency, fault-tolerance, and high availability.
- Experience with contribution to Open Source projects is a plus.
- Experience with multiple public cloud infrastructure, managing multi-tenant Kubernetes clusters at scale and debugging Kubernetes/Spark issues.
- Experience with workflow and data pipeline orchestration tools (e.g., Airflow, DBT).
- Understanding of data modeling and data warehousing concepts.
- Familiarity with the AI/ML stack, including GPUs, MLFlow, or Large Language Models (LLMs).
- Data Structures & Algorithms: Strong foundation and application experience.
- Distributed Systems: Solid understanding and hands-on experience managing at least one distributed system (e.g. Kafka, Spark, Flink etc. ).
- Solid understanding of software engineering best practices, including the full development lifecycle, secure coding, and experience building reusable frameworks or libraries.
- Problem Solving: Demonstrated ability to independently troubleshoot and resolve complex technical issues.
- Creative Thinking: A track record of proposing and implementing innovative solutions to technical challenges.