Site Reliability Engineer

apple

Bengaluru (Bangalore) 5 Years Exp Posted 47d ago

Job Description

  • Design, develop, and automate: Build tools, frameworks and solutions to improve reliability, scalability, and efficiency across large scale distributed data platform systems.
  • Monitor and maintain: Implement advanced monitoring and alerting for on-prem , cloud and workloads.
  • Troubleshoot and solve: Support critical applications including analytics, reporting, and AI/ML apps. Respond to and resolve complex production incidents, and perform root cause analysis.
  • Collaborate: Work closely with development and operations teams to integrate reliability best practices throughout the software lifecycle.
  • Optimize: Proactively recommend improvements in architecture, deployment, and operations for distributed systems

Minimum Qualifications

  • Experience: 5+ years in software site reliability engineering or software development roles.
  • Programming: Proficient in at least one of Python, Golang, or Java.
  • Skilled at coding for distributed systems and developing resilient data pipelines.
  • Cloud Platforms: Hands-on experience with at least one major cloud platform (AWS, Azure, or Google Cloud Platform).

Preferred Qualifications

  • Expertise in designing, building, and operating critical, large-scale distributed systems with a focus on low latency, fault-tolerance, and high availability.
  • Experience with contribution to Open Source projects is a plus.
  • Experience with multiple public cloud infrastructure, managing multi-tenant Kubernetes clusters at scale and debugging Kubernetes/Spark issues.
  • Experience with workflow and data pipeline orchestration tools (e.g., Airflow, DBT).
  • Understanding of data modeling and data warehousing concepts.
  • Familiarity with the AI/ML stack, including GPUs, MLFlow, or Large Language Models (LLMs).
  • Data Structures & Algorithms: Strong foundation and application experience.
  • Distributed Systems: Solid understanding and hands-on experience managing at least one distributed system (e.g. Kafka, Spark, Flink etc. ).
  • Solid understanding of software engineering best practices, including the full development lifecycle, secure coding, and experience building reusable frameworks or libraries.
  • Problem Solving: Demonstrated ability to independently troubleshoot and resolve complex technical issues.
  • Creative Thinking: A track record of proposing and implementing innovative solutions to technical challenges.

Similar Openings for You