Site Reliability Engineer (SRE) with AWS, Oracle, and Automation Expertise
myworkdayjobs
Job Description
-
-
Strong understanding and hands-on experience managing production-grade systems with high reliability and availability requirements
-
Expertise in SRE principles, monitoring, logging, alerting, and defining SLOs/SLA tuning
-
Proficiency with AWS services including EC2, S3, RDS, VPC, IAM, and CloudWatch (latest versions or equivalents)
-
Linux system administration and troubleshooting skills for enterprise environments
-
Experience with Oracle databases, including performance tuning, RAC, or RMAN in large data environments
-
Automation scripting skills using Python and Shell (Bash/sh) for operational automation
-
Experience with monitoring tools such as Prometheus, Grafana, ELK/EFK, and PagerDuty
-
Familiarity with CI/CD tools like Jenkins, GitLab CI, or AWS CodePipeline
-
-
Preferred:
-
Knowledge of OFSAA, Oracle Rules Engine, or ML-enabled platform support (e.g., TRACE)
-
Infrastructure-as-Code tools such as CloudFormation or Terraform
-
Experience with support for high-performance Oracle environments (performance tuning, RAC, RMAN)
-
Exposure to cloud-native and containerized environments (Kubernetes, Docker)
-
Overall Responsibilities
-
Improve the reliability, availability, and recoverability of Financial Crime and Transaction Monitoring platforms.
-
Define, monitor, and manage SLIs/SLOs to proactively ensure service health and detect anomalies.
-
Provide Level 1 and Level 2 support for AWS and Oracle-based platforms, handling incident resolution and root cause analysis.
-
Build and sustain automation solutions for monitoring, logging, alerting, and operational workflows to reduce manual toil.
-
Lead incident response activities, conduct post-incident reviews, and implement preventative measures.
-
Develop, operate, and enhance CI/CD pipelines and infrastructure automation across environments.
-
Collaborate with engineering teams to design scalable, resilient, and secure systems; participate in capacity planning and performance tuning.
-
Support deployment, patching, and configuration changes, ensuring compliance with policies and standards.
-
Maintain comprehensive documentation of operational procedures, configurations, and incident resolutions.
-
Lead continuous process improvements to enhance system reliability, operational efficiency, and compliance adherence.
Technical Skills (By Category)
-
Systems & Support (Essential):
-
Enterprise-level system operation and support for AWS and Oracle environments
-
Linux system administration and troubleshooting
-
Incident management and escalation procedures
-
-
Monitoring & Automation (Essential):
-
Monitoring and alerting using Prometheus, Grafana, ELK/EFK, CloudWatch
-
Automation scripting with Python and Shell for operational tasks and event handling
-
-
Cloud & Infrastructure (Preferred):
-
Cloud deployment, scaling, and management (AWS, Azure, GCP)
-
Infrastructure-as-Code (Terraform, CloudFormation)
-
-
Databases/Data Management (Essential):
-
Oracle database management, performance tuning, and recovery
-
Data extraction and validation for high-volume transactional data
-
-
Development Tools & Methodologies (Essential):
-
Jenkins, GitLab CI, AWS CodePipeline for CI/CD pipelines
-
Version control with Git
-
Experience Requirements
-
Minimum of 8+ years supporting high-availability, mission-critical ent