Sr Engineer Site Reliability
empower
Job Description
Job Duties:
• Responsible for a diverse set of assignments, spanning a wide range of technologies and a high level of complexity
• Works independently on assignments
• Work is reviewed for soundness of judgment, overall accuracy and adequacy
• Guide and implement monitoring in distributed systems, establishing key indicators (SLIs) along the way
• Lead the design and instrumentation of how metrics are gathered from multiple sources
• Improve the availability and performance of systems and services
• Spread Site Reliability principals and knowledge across the organization
• Play a lead role in ensuring the high availability, resilience, and scalability of containerized applications in production
• Manage observability within Kubernetes, specifically EKS
• Improve the availability and performance of systems and services
• Measure the effectiveness and business impact of the SRE practice
• Collaborate with development teams to support releases and create highly scalable, resilient, and maintainable services
• Work in a GitOps driven environment
Qualification:
• Expert understanding in one or more observability suites and APM tooling such as DataDog, AppDynamics, New Relic, etc.
• Expert understanding in maintaining high availability and resiliency within AWS infrastructure components, including EKS, EC2, RDS, S3, VPC, and others
• Strong programming skills in one or more languages such as shell, Go, Python, etc.
• Expertise with Infrastructure as Code frameworks such as Terraform and CloudFormation
• Proficiency with containerization and orchestration technologies such as Docker and Kubernetes
- Good Experience in setting up DevOps pipelines in Jenkins
- Experience in using GitHub copilot preferred