DevOps Engineer
hirist
Job Description
- Own the reliability, availability, and performance of microservices and production workloads.
- Design and improve resilient infrastructure on GCP, with strong emphasis on Cloud Run, Kubernetes, and containerized services.
- Build and maintain observability across logs, metrics, tracing, alerting, and service health so issues are detected early and resolved quickly.
- Improve deployment safety through stronger CI/CD pipelines, release controls, rollback strategies, and environment consistency.
- Lead incident response and production readiness practices, including runbooks, postmortems, on-call hygiene, capacity planning, and resilience testing.
- Reduce operational toil by automating repetitive work and improving tooling for engineers supporting distributed services.
- Partner with development teams to improve the operability, scalability, and fault tolerance of microservices early in the design lifecycle.
- Strengthen cloud security and infrastructure hygiene across IAM, secrets management, workload hardening, and production safeguards.
- Improve service performance, resource efficiency, and cloud cost management without compromising reliability.
- Support architecture and reliability reviews for critical services and high-traffic business events.
Qualifications:
- 5+ years of experience in Site Reliability Engineering or closely related DevOps roles with meaningful production ownership.
- Strong experience running production systems on Google Cloud Platform.
- Hands-on experience with Cloud Run, Kubernetes, and container-based microservices in production.
- Strong experience with infrastructure as code, particularly Terraform and Terragrunt.
- Strong understanding of observability using tools such as OpenTelemetry, Cloud Monitoring, New Relic, or equivalent systems.
- Strong understanding of distributed systems, microservice failure modes, reliability engineering, and production debugging.
- Experience building or improving CI/CD pipelines and release workflows in modern engineering environments, including GitHub Actions.
- Ability to write code and automation in one or more languages such as Python or Java.
- Good judgment during incidents and a practical mindset around reliability, recovery, and risk tradeoffs.
- Strong written and verbal communication skills, with the ability to work effectively across engineering teams.
- Experience working with AI tooling and agentic workflows in engineering or operational environments.
- Experience in retail, e-commerce, or other customer-facing environments is a plus.