Site Reliability Engineer Lead
bankofamerica
Job Description
Responsibilities for job opportunity
- As part of the SRE team, perform full stack triaging of alerts and engage other engineers to identify root cause of application performance & stability issues.
- Work with stakeholders such as product owners to define service level objectives (SLOs) for application features and services.
- Track performance against SLOs in partnership with development teams or other stakeholders, and ensure systems continue to meet SLOs over time.
- Design, develop dashboards and reports to communicate key metrics.
- Identify opportunities to improve alerting posture and create/update alerts accordingly.
- Work closely with the Engineering team to understand application architecture and perform Single point of failure analysis and create scenarios for testing resiliency of the application.
- Create/derive NFR/Workload model and ensure performance & resiliency is considered early in the SDLC.
- Execute performance/chaos tests, analyze using APM and other tools to identify performance & stability issues.
- Document any findings/analysis/results, communicate and present to stakeholders.
- Perform analytics on previous incidents to understand root causes and use automation to reduce the probability and/or impact of problem recurrence.
- Demonstrate proficiency with DevOps tools, JIRA, ServiceNow, MS Project and perform tasks using the tools.
Requirements
Education: B.E. / B. Tech / M.E. / M. Tech / MCA / Msc (IT/Computer Science)
Certifications If Any: NA
Experience Range: 8 to 10 years of information technology experience with 5+ years working on DevOps or SRE team or performance engineering team.
Foundational Skills
- 8+ years of information technology experience with 5+ years working on DevOps or SRE team or performance engineering team
- Experienced in triaging of production issues using APM tools such as Dynatrace or AppDynamics or New Relic and log aggregation tools such as Splunk, ELK, etc.
- Strong experience in Java and Front-end development (UI and UX) (React JS, Angular)
- Experience with Apache/tomcat Middleware and Java/RESTful services framework (mulesoft is a plus)
- Strong Python, UNIX, Wintel, Perl/Shell scripting
- Strong experience working with CI/CD tools - bitbucket, JFrog Artifactory, Jenkins, Artifactory, Terraform/Packer, Ansible
- Knowledge on Cloud, Container and Kubernetes technologies
- Experience with SRE concepts like SLI/SLOs & error budgets and working with developers to track and improve them on a continuous basis.
- Must be able to provide oral and written discussion of analytical findings using narrative and graphic forms.
- Must be able to use qualitative and quantitative analytical skills to assess the effectiveness of the operations.
- Identifying symptoms for process improvement.
- Analytical and investigation, and organization skills
- Communications including being able to craft content for executive level presentations.
Desired Skills
- Great soft skills – People and communications skills are essential.
- Good proficiency in system, network, security and database operations, protocols, and industry standard technologies.
- Experience with tools such as Tanium, Artifactory, BMC TrueSight Orchestration
- Experience in command line interfaces (CLI), third party APIs and integration.
- Experience in server administration with Red Hat Enterprise Linux and Windows Server
- Good understanding of developing fault tolerant solutions and knowledge in horizontal scaling and resiliency/HA.
- Ability to juggle competing priorities and adapt to changes in project scope.
- College Degree or Higher or equivalent work experience