Software Engineer - Infrastructure
instahyre
Job Description
Platform and Infrastructure:
- Maintain stability of our platform consisting of distributed microservices closely interacting with Kubernetes and cloud providers (GCP, AWS).
- Manage Kubernetes workloads with ArgoCD (GitOps) deploy, monitor, and troubleshoot application syncs, resource trees, and rollouts.
- Debug and resolve complex Kubernetes issues across clusters.
- Manage CDN and edge infrastructure (Cloudflare) for performance, caching, and traffic management.
- Automate infrastructure lifecycle operations and workflows.
Observability and Incident Response:
- Own the observability stack: Grafana (dashboards, Loki logs, Prometheus metrics), New Relic (APM, golden metrics, transaction analysis).
- Enhance monitoring, alerting, and distributed tracing across services.
- Participate in on-call rotation via PagerDuty, handle incident response, and perform root cause analysis.
- Proactively identify reliability risks before they become incidents.
AI Agent Infrastructure:
- Support the platform that runs AI agent workloads, job scheduling, trajectory tracking, environment provisioning, deployments and cost attribution.
- Develop Kubernetes controllers and operators to extend platform capabilities for agent orchestration.
Collaboration and Internal Tooling
- Work closely with product and backend teams to ensure platform scalability and reliability.
- Build internal tools, automate workflows, and integrate systems to improve team productivity.
- Stay current with Kubernetes releases, CNCF ecosystem updates, and cloud-native best practices.
The core requirements for the job include the following:
Core Requirements:
- 3+ years of software/platform engineering experience with production systems.
- Strong proficiency in Go or Python, you write production code in at least one daily.
- Hands-on experience building and deploying services on Kubernetes, not just YAML; you've developed something that runs on K8S.
- Experience with GitOps tooling (ArgoCD, Flux, or similar).
Systems Fundamentals:
- Strong networking and DNS fundamentals, TCP/IP, HTTP, load balancing, DNS resolution, TLS, and debugging connectivity issues.
- Solid Linux/OS fundamentals, process management, filesystem, memory, systemd, and comfortable debugging with tools like strace, tcpdump, and netstat.
Data and Messaging Infrastructure:
- Relational databases experience with PostgreSQL, MySQL, or similar; indexing, query optimisation, replication, and backup/restore procedures.
- NoSQL databases familiarity with MongoDB, DynamoDB, Redis, or similar for document/key-value workloads.
- Caching experience with Redis, Memcached, or similar for application and infrastructure-level caching.
- Message queues and streaming hands-on with Kafka, SQS, RabbitMQ, or similar for event-driven architectures.
- Strong SQL skills for debugging and operational queries.
Infrastructure and Observability:
- Comfortable with the CNCF ecosystem, Helm, Kustomize, cert-manager, Ingress controllers, CNI/CSI interfaces.
- Hands-on with at least one observability stack (Grafana/Prometheus/Loki, New Relic, Datadog, or similar).
- Familiarity with GCP and/or AWS managed Kubernetes (GKE/EKS), networking, IAM, storage, and cloud-native services (SES, SQS, S3 etc. )
- Experience with CDN/edge platforms (Cloudflare, CloudFront, or similar).
Nice to Have:
- Experience building Kubernetes Operators (kubebuilder, operator-sdk, or controller-runtime).
- Experience tuning Kubernetes core components (API server, kubelet, scheduler).
- Familiarity with AI/LLM infrastructure, token management, cost tracking, and agent orchestration.
- Experience with CI/CD pipelines (GitHub Actions, automated testing, deployment pipelines).
- Infrastructure as Code experience (Terraform, Pulumi, or similar).
- Previous work on large-scale distributed systems or platform-as-a-service.
- Startup experience, you thrive in fast-paced, ambiguous environments.
Expectations:
- You're a generalist who can context-switch between debugging a K8S deployment, setting up a Grafana alert, and configu