Site Reliability Engineer (SRE) with AWS, Oracle, and Automation Expertise

myworkdayjobs

Bengaluru (Bangalore) 8 Years Exp Posted 97d ago

Preferred:
- Knowledge of OFSAA, Oracle Rules Engine, or ML-enabled platform support (e.g., TRACE)
- Infrastructure-as-Code tools such as CloudFormation or Terraform
- Experience with support for high-performance Oracle environments (performance tuning, RAC, RMAN)
- Exposure to cloud-native and containerized environments (Kubernetes, Docker)

Overall Responsibilities

Improve the reliability, availability, and recoverability of Financial Crime and Transaction Monitoring platforms.
Define, monitor, and manage SLIs/SLOs to proactively ensure service health and detect anomalies.
Provide Level 1 and Level 2 support for AWS and Oracle-based platforms, handling incident resolution and root cause analysis.
Build and sustain automation solutions for monitoring, logging, alerting, and operational workflows to reduce manual toil.
Lead incident response activities, conduct post-incident reviews, and implement preventative measures.
Develop, operate, and enhance CI/CD pipelines and infrastructure automation across environments.
Collaborate with engineering teams to design scalable, resilient, and secure systems; participate in capacity planning and performance tuning.
Support deployment, patching, and configuration changes, ensuring compliance with policies and standards.
Maintain comprehensive documentation of operational procedures, configurations, and incident resolutions.
Lead continuous process improvements to enhance system reliability, operational efficiency, and compliance adherence.

Technical Skills (By Category)

Systems & Support (Essential):
- Enterprise-level system operation and support for AWS and Oracle environments
- Linux system administration and troubleshooting
- Incident management and escalation procedures

Monitoring & Automation (Essential):
- Monitoring and alerting using Prometheus, Grafana, ELK/EFK, CloudWatch
- Automation scripting with Python and Shell for operational tasks and event handling

Cloud & Infrastructure (Preferred):
- Cloud deployment, scaling, and management (AWS, Azure, GCP)
- Infrastructure-as-Code (Terraform, CloudFormation)

Databases/Data Management (Essential):
- Oracle database management, performance tuning, and recovery
- Data extraction and validation for high-volume transactional data

Development Tools & Methodologies (Essential):
- Jenkins, GitLab CI, AWS CodePipeline for CI/CD pipelines
- Version control with Git

Experience Requirements