Site Reliability Engineer

Cognizant

Chennai 9 Years Exp Posted 491d ago

Job Description

Key Responsibilities:

· Design, Implement and/or refine Service Management processes. (Monitoring, Incident, Problem, Capacity, Change & Releases and Service Level Management)

· Track system health, performance and reliability via monitoring, observability platforms, implement proactive alerting mechanisms to detect anomalies and respond swiftly to incidents.

· Act as a point of escalation for complex incidents, collaborating with senior engineers and management to ensure effective resolution.

· Establish and enforce change control and release management processes to ensure smooth and controlled deployment of system changes.

· Conduct post-incident analyses to identify root causes and implement actions to prevent recurrence and improve system resilience.

· Perform regular system testing to identify vulnerabilities and validate disaster recovery plans.

· Partner with development teams to improve services through rigorous testing and release procedures.

· Participate in system design consulting, platform management, and capacity planning.

· Integrate reliability practices into CI/CD pipelines to automate testing, quality assurance, and deployment processes.

· Foster a culture of collaboration between development and operations teams, promoting shared ownership and accountability for system reliability.

· Create sustainable systems and services through automation and uplifts.

· Balance feature development speed and reliability with well-defined service-level objectives

· Continuously evaluate and enhance system reliability, scalability and performance. Identify areas for improvement and implement solutions to optimize processes and reduce manual toil.

· Define, track, and monitor SLAs/ SLOs to measure and improve system reliability.

Collaborate with cross-functional teams to ensure scalable and adequate resource allocations and optimize cost efficiency.

Required skills and qualifications

· Bachelor’s degree (or equivalent) in computer science or related discipline

· Proven Process definition and Implementation experience, leveraging ITIL best practices

· Minimum ITIL V3 Intermediate / Expert certified - Mandatory

· Implementation experience of ITSM / ESM tools (e.g., SNOW, Remedy, JIRA)

· Strong DevSecOps skills with implementation experience – Foundation / Practitioner certification will be an advantage.

· Coding experience beyond simple scripts – Python, Java, C/C++ and JavaScript

· Knowledge of Linux/ Unix systems administration and troubleshooting skills

· Knowledge of relational and NoSQL databases and distributed storage systems Proficiency in database administration, query optimization, and data replication.

· Familiarity with Incident management and collaboration tools such as JIRA, PagerDuty, Slack, or ServiceNow.

· Expertise in performance monitoring and analysis tools such as New Relic, AppDynamics, or Datadog.

· Familiarity with configuration management tools like Ansible, Puppet, or Chef

· Knowledge of Observability (e.g, Dynatrace, SolarWinds) and monitoring systems (e.g., Prometheus, Nagios) and log management tools (e.g., ELK stack, Splunk).

· Strong analytical thinking and problem-solving abilities to identify patterns, troubleshoot issues, and propose effective solutions.

· Proactive approach to identifying problems, performance bottlenecks, and areas for improvement.

· Previous success in technical engineering

Site Reliability Engineer

Job Description

Similar Openings for You

Senior Quality Assurance Engineer

QA Engineer

Manual Test Lead

Senior Quality Assurance Analyst