Lead Site Reliability Engineer

mcafee

Bengaluru 8 Years Exp Posted 450d ago

About the Role:

Proficiently utilizing AWS services like EC2, RDS, VPC, and CloudWatch, with expertise in log query analysis and monitoring optimization.
Driving APM monitoring solutions through hands-on experience with Prometheus, Grafana, and scripting in PMQL to enhance automation capabilities.
Troubleshooting and debugging issues by analyzing CloudWatch logs and providing detailed insights through log metrics and trend analysis.
Optimizing service performance by leveraging AWS tools, analyzing cost utilization, and implementing efficient scaling strategies.
Managing Kubernetes-based setups, including deployment, configuration changes, and service lifecycle management, while collaborating with DevOps teams.
Overseeing seamless code rollouts, maintaining production integrity, and spearheading root cause analysis for persistent issues.
Fostering collaboration across global teams to enhance service availability, reliability, and scalability.

About you:

Over 8 years of experience in the web and e-commerce domain, with a strong focus on cloud hosting, primarily AWS.
Skilled in log analysis and troubleshooting, with expertise in leveraging observability platforms to trace issues thoroughly.
Proficient in Prometheus, Grafana, or similar APM tools, with hands-on ability to optimize and enhance monitoring capabilities.
Passionate about digging deep into technical issues and providing actionable insights for resolution.
Adept at driving troubleshooting calls and ensuring end-to-end traceability for complex problems.
Strong knowledge of cloud environments and monitoring frameworks to support robust and scalable solutions.