Lead Site Reliability Engineer - Tech Ops
morningstar
Job Description
Responsibilities
- Oversee and enhance the reliability and performance of production applications and infrastructure.
- Lead and mentor a team of Site Reliability Engineers, fostering a culture of accountability and continuous improvement.
- Design and manage AWS infrastructure, utilizing services such as EC2, S3, Lambda, and RDS, ensuring best practices in cloud architecture and security.
- Drive automation of deployment, monitoring, and incident response processes using CI/CD tools and methodologies.
- Establish and implement monitoring and alerting frameworks to proactively manage system health and uptime.
- Analyze operational metrics and AWS usage to identify opportunities for cost optimization and performance enhancement.
- Collaborate closely with development, security, and operations teams to improve workflows and support a DevOps culture.
- Develop and maintain documentation for systems architecture and operational procedures.
- Participate in strategic planning to align technology initiatives with business goals.
Requirements
- Bachelor’s degree in Computer Science, Information Technology, or a related field.
- Minimum of 7 years of experience in Site Reliability Engineering, DevOps, or a related field, with at least 3 years in a leadership or mentoring role.
- Extensive experience with AWS services and cloud infrastructure management.
- Strong proficiency in scripting and automation tools (e.g., Python, Bash, Terraform).
- Hands-on experience with container orchestration tools (e.g., Kubernetes, Docker).
- Expertise in setting up and managing monitoring and logging tools (e.g., CloudWatch, ELK stack, Prometheus).
- Deep understanding of networking, security principles, and best practices.
- Proven track record in technical project management and implementing operational improvements.
- AWS certifications (e.g., AWS Certified Solutions Architect, AWS Certified DevOps Engineer).
- Familiarity with configuration management tools (e.g., Ansible, Chef, Puppet).
- Experience in agile software development methodologies and practices.
- Strong analytical and problem-solving skills, with a focus on proactive solutions.