SRE Engineer
mccain
Job Description
JOB RESPONSIBILITIES:
- Work with stakeholders such as product owners and Engineering to define service level objectives (SLOs) for system operations.
- Track performance against SLOs in partnership with monitoring teams or other stakeholders, and ensure systems continue to meet SLOs over time.
- Create dashboards and reports to communicate key metrics.
- Create software to improve performance, scalability, and stability of systems.
- Collaborate with development teams to promote the concept of reliability engineering during all phases of the software development lifecycle to detect and correct performance issues and meet availability goals.
- Design, code, test, and deliver infrastructure software to automate manual operational work (i.e., “toil”).
- Participate in operational support and on-call rotation shifts for supported systems and products.
- Conduct blameless post mortems to troubleshoot priority incidents.
- Perform analytics on previous incidents to understand root causes and better predict and prevent future issues.
- Use automation to reduce the probability and/or impact of problem recurrence.
- Identify, evaluate, and recommend monitoring tools and diagnostic techniques to improve system observability.
- Participate in system design consulting, platform management, capacity planning and launch reviews.
- Collaborate and share lessons learned regarding performance and reliability issues with all stakeholders including developers, other SREs, operations teams, and project management teams.
- Participate in communities of practice to share knowledge and foster continuous improvement.
- Remain current on site reliability engineering methods and trends such as observability-driven development and chaos engineering.
- Drive continuous improvement in software quality and infrastructure reliability and resilience.
- Oversee, design, implement, and manage DevOps capabilities using continuous integration/continuous delivery toolsets and automation.
- SRE engineer will focus on Application Performance Monitoring (APM) including Design, Solution, POC, profiling and tuning application compute and data nodes and resources. Some key duties of this role are:
- Assist in defining SRE and Observability architecture, design
- Analyze, Implement new features of SRE and Observability Platform
- Full stack monitoring across all layers (Infrastructure/Network/Database/Application/Services/Third Party)
- Provide technical hands-on leadership in commercial and Open source/commercial monitoring Tool salection Implementation.
- Implement SRE driven automated Incident Detection -> automated Engagement –> Triage/Mitigate – RCA/Postmortems -> Problem task Remediation.
- AI Driven Correlation, De-duplication Noise Reduction and Auto Remediation
- Provide weekly monitoring and alert analysis and continuous improvement
- Create a model of the run-time environment (discovery)
- Profile the performance and behavior of user-defined transactions
- Establish Performance metrics from each of the applications/systems technical components (Webserver, App server, Database, etc.)
- Application performance management database
- APM tool Administration and Support
- Monitoring Tool design and implementation
- APM Setup/Usage policies and guidelines
- Capacity Planning and monitoring
- Monitor selected application performance
- Report vital statistics of application performance in production
- Make recommendations for improvements with Service Desk
- Make recommendations for adjustments to runtime resources to improve overall performance profile
KEY QUALIFICATION & EXPERIENCES:
- Strong problem solving and analytical skills.
- Strong interpersonal and written and verbal communication skills.
- Highly adaptable to changing circumstances. Interest in continuously learning new skills and technologies.
- Experience with programming and scripting languages (e.g. Java, C#, C++, Python, Bash, PowerShell).
- Experience with incident and response management.
- Experience with Agile and DevOps development methodologies.
- Experience with container technologies and supporting tools (e.g. Docker Swarm, Podman, Kubernetes, Mesos).
- Experience with working in cloud ecosystems (Microsoft Azure AWS, Google Cloud Platform,).
- Experience with monitoring and observability tools (e.g. Splunk, Cloudwatch, AppDynamics, NewRelic, ELK, Prometheus, OpenTeleme