Lead Site Reliability Engineer
mcafee
Job Description
About the Role:
- Proficiently utilizing AWS services like EC2, RDS, VPC, and CloudWatch, with expertise in log query analysis and monitoring optimization.
- Driving APM monitoring solutions through hands-on experience with Prometheus, Grafana, and scripting in PMQL to enhance automation capabilities.
- Troubleshooting and debugging issues by analyzing CloudWatch logs and providing detailed insights through log metrics and trend analysis.
- Optimizing service performance by leveraging AWS tools, analyzing cost utilization, and implementing efficient scaling strategies.
- Managing Kubernetes-based setups, including deployment, configuration changes, and service lifecycle management, while collaborating with DevOps teams.
- Overseeing seamless code rollouts, maintaining production integrity, and spearheading root cause analysis for persistent issues.
- Fostering collaboration across global teams to enhance service availability, reliability, and scalability.
About you:
- Over 8 years of experience in the web and e-commerce domain, with a strong focus on cloud hosting, primarily AWS.
- Skilled in log analysis and troubleshooting, with expertise in leveraging observability platforms to trace issues thoroughly.
- Proficient in Prometheus, Grafana, or similar APM tools, with hands-on ability to optimize and enhance monitoring capabilities.
- Passionate about digging deep into technical issues and providing actionable insights for resolution.
- Adept at driving troubleshooting calls and ensuring end-to-end traceability for complex problems.
- Strong knowledge of cloud environments and monitoring frameworks to support robust and scalable solutions.