Site Reliability Engineer (SRE)

Siemens

Bangalore 6 Years Exp Posted 464d ago

Job Description

You’ll make a difference by:

You’d describe yourself as:

- Being a SRE L2 Support role focuses on maintaining and improving the reliability, availability, and performance of AWS-based infrastructure and applications. This role involves monitoring, incident management, fixing, and proactive support of cloud services to ensure seamless operations and scalability.
- Having experience in incident Management and Solving by actively monitor infrastructure and application health in AWS
- Handling and resolving L2 incidents related to AWS services (EC2, RDS, S3, Lambda, EKS, etc.) with root cause analysis.
- Providing timely communication to customers during outages and SLA breaches.
- Setting up and fine-tune AWS monitoring and observability tools to detect issues early.
- Creating alarms, dashboards, and reports in CloudWatch for compute, storage, and networking services.
- Using AWS Health Dashboard to proactively identify service disruptions.
- Managing and analyze logs using tools like AWS CloudWatch Logs, CloudTrail, and third-party solutions (e.g., ELK Stack, Datadog, Splunk).
- Identifying anomalies and trends to detect and prevent recurring issues.
- Tackle and resolve issues related to EC2 instances, Autoscaling Groups, and Load Balancers (ELB/ALB/NLB).
- Supervising server health, resource utilization, and performance bottlenecks.
- Supporting containerized workloads running on Amazon ECS, EKS,
- Debugging Kubernetes pods, clusters, and container runtime issues.
- Resolving issues with Amazon S3, EBS, and EFS, ensuring data integrity and access permissions.
- Monitoring RDS (PostgreSQL, Aurora) performance, replication, and scaling.
- Debugging AWS Transit Gateway, VPN, and Direct Connect connectivity problems.
- Ensuring proper IAM policies and roles for secure access management.
- Supporting maintenance activities such as patching EC2 instances, upgrading container runtimes, and managing system updates.
- Participating in the automation of repetitive tasks using scripts.
- Contributing to incident recovery processes and post-mortems to prevent recurrence.
- Providing support for failed deployments and ensure quick recovery.
- Monitoring AWS Backup jobs and ensure regular backups for critical infrastructure.
- Validating DR (Disaster Recovery) plans and participate in recovery testing exercises.
- Creating and maintaining operational runbooks, SOPs (Standard Operating Procedures), and knowledge base articles for common AWS issues.
- Experienced professional with 6 to 9 years of relevant experience in SRE, DevOps, Cloud Infrastructure Support with strong hands-on expertise in AWS services.
- Having hands-on experience with monitoring tools (e.g., Prometheus, Datadog).
- Possessing knowledge of Linux/Unix operating systems and basic scripting skills (Python, Gitlab actions.
- Experiencing working with cloud platforms (AWS, Azure, or GCP).
- Familiarity with container orchestration (Kubernetes, Docker, Helmcharts) and CI/CD pipelines, ArgoCD for implementing GitOps workflows and automated deployments for containerized applications.
- Showing experience in Datadog, AWS EC2, Lambda, ECS/EKS, RDS, VPC, Route 53, ELB, S3, EFS, Glacier.
- Strong analytical skills to resolve production incidents effectively.
- Basic understanding of networking concepts (DNS, Load Balancers, Firewalls).
- Good communication and interpersonal skills for incident communication and partner concern.
- Experience with alerting systems (PagerDuty etc.,) and incident tracking tools (JIRA, ServiceNow).
- Being proactive problem-solver with a sense of urgency.
- Strong organizational skills to prioritize tasks efficiently.
- Ability to work effectively in high-pressure environments.
- Teammate with the ability to collaborate across teams and shift ownership as required.

Site Reliability Engineer (SRE)

Job Description

Similar Openings for You

Manual Test Lead

Senior Quality Assurance Analyst

IT QA Analyst

Sr. Software QA Engineer