Senior DevOps Engineer
thalesgroup
Job Description
Key Responsibilities:
-
Monitoring & Observability: Design, implement, and maintain sophisticated monitoring, alerting, and logging solutions to ensure the reliability, availability, and performance of our security-focused SaaS platform. Use tools like Prometheus, Grafana, Datadog to provide deep visibility into system health, security metrics, and application performance.
-
Incident Management: Respond to and mitigate incidents in real time, ensuring minimal impact on customers. Drive post-mortems and root cause analyses (RCAs) to improve monitoring and response processes.
-
System Reliability: Collaborate with cross-functional teams to define and implement Service Level Indicators (SLIs) and Service Level Objectives (SLOs) for both security and performance metrics.
-
Automation & CI/CD Integration: Build automated monitoring and alerting pipelines that integrate seamlessly with CI/CD workflows to catch issues early in development, testing, and production environments.
-
Mentorship & Best Practices: Provide guidance and mentorship to junior DevOps engineers, helping them adopt best practices for monitoring, observability, and security.
-
Optimization & Continuous Improvement: Continuously evaluate and refine monitoring tools and practices to adapt to new threats, technologies, and regulatory requirements.
Required Qualifications:
-
5+ years of experience in DevOps, Site Reliability Engineering, or Infrastructure roles, ideally in cybersecurity or SaaS environments.
-
Strong experience with monitoring tools like Prometheus, Grafana, Datadog, ELK, Splunk, or similar observability solutions.
-
Expertise in Linux/Unix-based systems and cloud environments (AWS, GCP, Azure).
-
Proficiency in scripting languages such as Python, Bash, or Go to automate monitoring tasks and create custom solutions.
-
Deep understanding of security principles and experience integrating security monitoring into DevOps practices (e.g., SIEM systems, threat detection).
-
Experience with containerization (Docker) and orchestration (Kubernetes) to monitor containerized applications in production.
-
Familiarity with Infrastructure-as-Code (IaC) tools like Terraform, Ansible, or CloudFormation to automate infrastructure monitoring setup.
-
Solid problem-solving skills, a keen eye for detail, and a proactive approach to system monitoring and incident response.