Lead Site Reliability Engineer
cisco
Job Description
Responsibilities:
-
Design, build, and optimize cloud and data infrastructure to ensure the high availability, reliability, and scalability of big-data and ML/AI systems to meet customer needs, while implementing SRE principles such as monitoring, alerting, error budgets, and fault analysis.
-
Collaborate closely with cross-functional teams, including customers, development, product management, and security teams, to create secure, scalable solutions that support ML/AI workloads and enhance operational efficiency through automation.
-
Troubleshoot complex technical problems in production environments, perform root cause analyses, and contribute to continuous improvement efforts through postmortem reviews and proactive performance optimization.
-
Lead the architectural vision and shape the team’s technical strategy and roadmap, balancing immediate needs with long-term goals, driving innovation, and influencing the technical direction.
-
Serve as a mentor and technical leader, guiding teams and fostering a culture of engineering and operational excellence by sharing your deep knowledge and experience.
-
Engage with customers and stakeholders to understand use cases and feedback, translating them into actionable insights and effectively influencing stakeholders at all levels.
-
Utilize your strong programming skills to integrate software and systems engineering, building core data platform capabilities and automation to meet enterprise customer needs and roadmap objectives.
-
Develop strategic roadmaps, processes, plans, and infrastructure to efficiently deploy new software components at an enterprise scale while enforcing engineering best practices.
Minimum Qualifications
-
Ability to design and implement scalable and well tested solutions, with focus on operational efficiency.
-
Strong hands-on cloud experience, preferably AWS.
-
Infrastructure as a Code expertise, especially Terraform and Kubernetes/EKS.
-
Experience building and managing Cloud, Big Data, and ML/AI infrastructure, including hands-on expertise with Hadoop ecosystem components and related technologies such as EMR, Airflow, Spark, PySpark, AWS SageMaker, AWS Bedrock, Gobblin, Kafka, Iceberg, ORC, MapReduce, Yarn, HDFS, Hive, and Hudi.
-
Ability to write high quality code in Python, Go, or equivalent programming languages.
Preferred Qualifications
-
Solid understanding of Unix/Linux systems, the kernel, system libraries, file systems, and client-server protocols.
-
Have experience with architecting software and infrastructure at scale with a sense of ownership and accountability.
-
Experience with observability tools including Prometheus (Alertmanager), Grafana, Thanos, CloudWatch, OpenTelemetry, and the ELK stack.
-
Certifications: CKA (Certified Kubernetes Administrator), CKAD (Certified Kubernetes Application Developer), AWS Certified DevOps Engineer, or equivalent certifications in cloud and security domains.