SRE Engineer

mccain

New Delhi 5 Years Exp Posted 585d ago

Job Description

JOB RESPONSIBILITIES:

Work with stakeholders such as product owners and Engineering to define service level objectives (SLOs) for system operations.
Track performance against SLOs in partnership with monitoring teams or other stakeholders, and ensure systems continue to meet SLOs over time.
Create dashboards and reports to communicate key metrics.
Create software to improve performance, scalability, and stability of systems.
Collaborate with development teams to promote the concept of reliability engineering during all phases of the software development lifecycle to detect and correct performance issues and meet availability goals.
Design, code, test, and deliver infrastructure software to automate manual operational work (i.e., “toil”).
Participate in operational support and on-call rotation shifts for supported systems and products.
Conduct blameless post mortems to troubleshoot priority incidents.
Perform analytics on previous incidents to understand root causes and better predict and prevent future issues.
Use automation to reduce the probability and/or impact of problem recurrence.
Identify, evaluate, and recommend monitoring tools and diagnostic techniques to improve system observability.
Participate in system design consulting, platform management, capacity planning and launch reviews.
Collaborate and share lessons learned regarding performance and reliability issues with all stakeholders including developers, other SREs, operations teams, and project management teams.
Participate in communities of practice to share knowledge and foster continuous improvement.
Remain current on site reliability engineering methods and trends such as observability-driven development and chaos engineering.
Drive continuous improvement in software quality and infrastructure reliability and resilience.
Oversee, design, implement, and manage DevOps capabilities using continuous integration/continuous delivery toolsets and automation.
SRE engineer will focus on Application Performance Monitoring (APM) including Design, Solution, POC, profiling and tuning application compute and data nodes and resources. Some key duties of this role are:
Assist in defining SRE and Observability architecture, design
Analyze, Implement new features of SRE and Observability Platform
Full stack monitoring across all layers (Infrastructure/Network/Database/Application/Services/Third Party)
Provide technical hands-on leadership in commercial and Open source/commercial monitoring Tool salection Implementation.
Implement SRE driven automated Incident Detection -> automated Engagement –> Triage/Mitigate – RCA/Postmortems -> Problem task Remediation.
AI Driven Correlation, De-duplication Noise Reduction and Auto Remediation
Provide weekly monitoring and alert analysis and continuous improvement
Create a model of the run-time environment (discovery)
Profile the performance and behavior of user-defined transactions
Establish Performance metrics from each of the applications/systems technical components (Webserver, App server, Database, etc.)
Application performance management database
APM tool Administration and Support
Monitoring Tool design and implementation
APM Setup/Usage policies and guidelines
Capacity Planning and monitoring
Monitor selected application performance
Report vital statistics of application performance in production
Make recommendations for improvements with Service Desk
Make recommendations for adjustments to runtime resources to improve overall performance profile

KEY QUALIFICATION & EXPERIENCES:

Strong problem solving and analytical skills.
Strong interpersonal and written and verbal communication skills.
Highly adaptable to changing circumstances. Interest in continuously learning new skills and technologies.
Experience with programming and scripting languages (e.g. Java, C#, C++, Python, Bash, PowerShell).
Experience with incident and response management.
Experience with Agile and DevOps development methodologies.
Experience with container technologies and supporting tools (e.g. Docker Swarm, Podman, Kubernetes, Mesos).
Experience with working in cloud ecosystems (Microsoft Azure AWS, Google Cloud Platform,).
Experience with monitoring and observability tools (e.g. Splunk, Cloudwatch, AppDynamics, NewRelic, ELK, Prometheus, OpenTeleme

SRE Engineer

Job Description

Similar Openings for You

Sr. Data Engineer

Lead Data Engineer

Data Engineer

Data Engineer