Site Reliability Engineer (SRE) with AWS, Oracle, and Automation Expertise

myworkdayjobs

Bengaluru (Bangalore) 8 Years Exp Posted 47d ago

Job Description

    • Strong understanding and hands-on experience managing production-grade systems with high reliability and availability requirements

    • Expertise in SRE principles, monitoring, logging, alerting, and defining SLOs/SLA tuning

    • Proficiency with AWS services including EC2, S3, RDS, VPC, IAM, and CloudWatch (latest versions or equivalents)

    • Linux system administration and troubleshooting skills for enterprise environments

    • Experience with Oracle databases, including performance tuning, RAC, or RMAN in large data environments

    • Automation scripting skills using Python and Shell (Bash/sh) for operational automation

    • Experience with monitoring tools such as Prometheus, Grafana, ELK/EFK, and PagerDuty

    • Familiarity with CI/CD tools like Jenkins, GitLab CI, or AWS CodePipeline

     

 

  • Preferred:

    • Knowledge of OFSAA, Oracle Rules Engine, or ML-enabled platform support (e.g., TRACE)

    • Infrastructure-as-Code tools such as CloudFormation or Terraform

    • Experience with support for high-performance Oracle environments (performance tuning, RAC, RMAN)

    • Exposure to cloud-native and containerized environments (Kubernetes, Docker)

 

 

Overall Responsibilities

  • Improve the reliability, availability, and recoverability of Financial Crime and Transaction Monitoring platforms.

  • Define, monitor, and manage SLIs/SLOs to proactively ensure service health and detect anomalies.

  • Provide Level 1 and Level 2 support for AWS and Oracle-based platforms, handling incident resolution and root cause analysis.

  • Build and sustain automation solutions for monitoring, logging, alerting, and operational workflows to reduce manual toil.

  • Lead incident response activities, conduct post-incident reviews, and implement preventative measures.

  • Develop, operate, and enhance CI/CD pipelines and infrastructure automation across environments.

  • Collaborate with engineering teams to design scalable, resilient, and secure systems; participate in capacity planning and performance tuning.

  • Support deployment, patching, and configuration changes, ensuring compliance with policies and standards.

  • Maintain comprehensive documentation of operational procedures, configurations, and incident resolutions.

  • Lead continuous process improvements to enhance system reliability, operational efficiency, and compliance adherence.

 

 

Technical Skills (By Category)

 

  • Systems & Support (Essential):

    • Enterprise-level system operation and support for AWS and Oracle environments

    • Linux system administration and troubleshooting

    • Incident management and escalation procedures

     

 

  • Monitoring & Automation (Essential):

    • Monitoring and alerting using Prometheus, Grafana, ELK/EFK, CloudWatch

    • Automation scripting with Python and Shell for operational tasks and event handling

     

 

  • Cloud & Infrastructure (Preferred):

    • Cloud deployment, scaling, and management (AWS, Azure, GCP)

    • Infrastructure-as-Code (Terraform, CloudFormation)

     

 

  • Databases/Data Management (Essential):

    • Oracle database management, performance tuning, and recovery

    • Data extraction and validation for high-volume transactional data

     

 

  • Development Tools & Methodologies (Essential):

    • Jenkins, GitLab CI, AWS CodePipeline for CI/CD pipelines

    • Version control with Git

 

 

Experience Requirements

  • Minimum of 8+ years supporting high-availability, mission-critical ent

Similar Openings for You