Senior Site Reliability Engineer
truecaller
Job Description
What you bring in:
- Extensive knowledge of system administration on Linux environments, preferably working on high throughput and low latency systems.
- Strong hands-on experience with GCP services (or transferable AWS/Azure skills) — networking, IAM, compute, storage, Kubernetes (GKE/EKS/AKS).
- Extensive knowledge of Docker and Kubernetes.
- Excellent understanding of distributed system design across process and site boundaries.
- Hands-on experience with service orchestration, management, deployment activities, configuration management and all necessary automation.
- Strong grasp of process isolation and containerization concepts, being able to apply them when necessary.
- Container orchestration: Deep understanding of Kubernetes — deploying, scaling, monitoring clusters.
- Monitoring & Observability: Experience with tools like Prometheus, Grafana, Stackdriver, Datadog, New Relic, etc.
- Incident management: Practical experience responding to incidents, performing root cause analysis, and improving system reliability.
- Security best practices: Knowledge of cloud security, secrets management, and compliance basics.
- Good understanding of software development lifecycle, versioning, building, testing, staging and deployment processes with a strong continuous delivery mindset.
The impact you will create:
- Building tooling to ease the provisioning and scaling of infrastructure resources.
- Continuously improve and scale infrastructure components to handle growth.
- Improve overall systems performance and investigate failures taking part actively in future improvements discussion.
- Ensure systems availability, reachability, and maintainability building the necessary instrumentation, tooling, and alarming systems in order to escalate abnormalities.
- Being influential in monitoring and capacity planning together with the application development teams and in alignment with the business goals.
It would be great if you also have:
- Experience developing kubernetes operators.
- Experience deploying and scaling apache cassandra, scylladb, mysql, postgresql, redis or memcached.
- Go programming language experience or willingness to learn coding in Go(it'll help us build new k8s operators and improve the existing ones).