Site Reliability Engineer (SRE)
Siemens
Job Description
You’ll make a difference by:
You’d describe yourself as:
-
- Being a SRE L2 Support role focuses on maintaining and improving the reliability, availability, and performance of AWS-based infrastructure and applications. This role involves monitoring, incident management, fixing, and proactive support of cloud services to ensure seamless operations and scalability.
- Having experience in incident Management and Solving by actively monitor infrastructure and application health in AWS
- Handling and resolving L2 incidents related to AWS services (EC2, RDS, S3, Lambda, EKS, etc.) with root cause analysis.
- Providing timely communication to customers during outages and SLA breaches.
- Setting up and fine-tune AWS monitoring and observability tools to detect issues early.
- Creating alarms, dashboards, and reports in CloudWatch for compute, storage, and networking services.
- Using AWS Health Dashboard to proactively identify service disruptions.
- Managing and analyze logs using tools like AWS CloudWatch Logs, CloudTrail, and third-party solutions (e.g., ELK Stack, Datadog, Splunk).
- Identifying anomalies and trends to detect and prevent recurring issues.
- Tackle and resolve issues related to EC2 instances, Autoscaling Groups, and Load Balancers (ELB/ALB/NLB).
- Supervising server health, resource utilization, and performance bottlenecks.
- Supporting containerized workloads running on Amazon ECS, EKS,
- Debugging Kubernetes pods, clusters, and container runtime issues.
- Resolving issues with Amazon S3, EBS, and EFS, ensuring data integrity and access permissions.
- Monitoring RDS (PostgreSQL, Aurora) performance, replication, and scaling.
- Debugging AWS Transit Gateway, VPN, and Direct Connect connectivity problems.
- Ensuring proper IAM policies and roles for secure access management.
- Supporting maintenance activities such as patching EC2 instances, upgrading container runtimes, and managing system updates.
- Participating in the automation of repetitive tasks using scripts.
- Contributing to incident recovery processes and post-mortems to prevent recurrence.
- Providing support for failed deployments and ensure quick recovery.
- Monitoring AWS Backup jobs and ensure regular backups for critical infrastructure.
- Validating DR (Disaster Recovery) plans and participate in recovery testing exercises.
- Creating and maintaining operational runbooks, SOPs (Standard Operating Procedures), and knowledge base articles for common AWS issues.
- Experienced professional with 6 to 9 years of relevant experience in SRE, DevOps, Cloud Infrastructure Support with strong hands-on expertise in AWS services.
- Having hands-on experience with monitoring tools (e.g., Prometheus, Datadog).
- Possessing knowledge of Linux/Unix operating systems and basic scripting skills (Python, Gitlab actions.
- Experiencing working with cloud platforms (AWS, Azure, or GCP).
- Familiarity with container orchestration (Kubernetes, Docker, Helmcharts) and CI/CD pipelines, ArgoCD for implementing GitOps workflows and automated deployments for containerized applications.
- Showing experience in Datadog, AWS EC2, Lambda, ECS/EKS, RDS, VPC, Route 53, ELB, S3, EFS, Glacier.
- Strong analytical skills to resolve production incidents effectively.
- Basic understanding of networking concepts (DNS, Load Balancers, Firewalls).
- Good communication and interpersonal skills for incident communication and partner concern.
- Experience with alerting systems (PagerDuty etc.,) and incident tracking tools (JIRA, ServiceNow).
- Being proactive problem-solver with a sense of urgency.
- Strong organizational skills to prioritize tasks efficiently.
- Ability to work effectively in high-pressure environments.
- Teammate with the ability to collaborate across teams and shift ownership as required.