Software Engineer - Infrastructure

instahyre

Bangalore 4 Years Exp Posted 34d ago

Job Description

 

Platform and Infrastructure:

  • Maintain stability of our platform consisting of distributed microservices closely interacting with Kubernetes and cloud providers (GCP, AWS).
  • Manage Kubernetes workloads with ArgoCD (GitOps) deploy, monitor, and troubleshoot application syncs, resource trees, and rollouts.
  • Debug and resolve complex Kubernetes issues across clusters.
  • Manage CDN and edge infrastructure (Cloudflare) for performance, caching, and traffic management.
  • Automate infrastructure lifecycle operations and workflows.

 

Observability and Incident Response:

  • Own the observability stack: Grafana (dashboards, Loki logs, Prometheus metrics), New Relic (APM, golden metrics, transaction analysis).
  • Enhance monitoring, alerting, and distributed tracing across services.
  • Participate in on-call rotation via PagerDuty, handle incident response, and perform root cause analysis.
  • Proactively identify reliability risks before they become incidents.

 

AI Agent Infrastructure:

  • Support the platform that runs AI agent workloads, job scheduling, trajectory tracking, environment provisioning, deployments and cost attribution.
  • Develop Kubernetes controllers and operators to extend platform capabilities for agent orchestration.

 

Collaboration and Internal Tooling

  • Work closely with product and backend teams to ensure platform scalability and reliability.
  • Build internal tools, automate workflows, and integrate systems to improve team productivity.
  • Stay current with Kubernetes releases, CNCF ecosystem updates, and cloud-native best practices.

 

The core requirements for the job include the following:

 

Core Requirements:

  • 3+ years of software/platform engineering experience with production systems.
  • Strong proficiency in Go or Python, you write production code in at least one daily.
  • Hands-on experience building and deploying services on Kubernetes, not just YAML; you've developed something that runs on K8S.
  • Experience with GitOps tooling (ArgoCD, Flux, or similar).

 

Systems Fundamentals:

  • Strong networking and DNS fundamentals, TCP/IP, HTTP, load balancing, DNS resolution, TLS, and debugging connectivity issues.
  • Solid Linux/OS fundamentals, process management, filesystem, memory, systemd, and comfortable debugging with tools like strace, tcpdump, and netstat.

 

Data and Messaging Infrastructure:

  • Relational databases experience with PostgreSQL, MySQL, or similar; indexing, query optimisation, replication, and backup/restore procedures.
  • NoSQL databases familiarity with MongoDB, DynamoDB, Redis, or similar for document/key-value workloads.
  • Caching experience with Redis, Memcached, or similar for application and infrastructure-level caching.
  • Message queues and streaming hands-on with Kafka, SQS, RabbitMQ, or similar for event-driven architectures.
  • Strong SQL skills for debugging and operational queries.

 

Infrastructure and Observability:

  • Comfortable with the CNCF ecosystem, Helm, Kustomize, cert-manager, Ingress controllers, CNI/CSI interfaces.
  • Hands-on with at least one observability stack (Grafana/Prometheus/Loki, New Relic, Datadog, or similar).
  • Familiarity with GCP and/or AWS managed Kubernetes (GKE/EKS), networking, IAM, storage, and cloud-native services (SES, SQS, S3 etc. )
  • Experience with CDN/edge platforms (Cloudflare, CloudFront, or similar).

 

Nice to Have:

  • Experience building Kubernetes Operators (kubebuilder, operator-sdk, or controller-runtime).
  • Experience tuning Kubernetes core components (API server, kubelet, scheduler).
  • Familiarity with AI/LLM infrastructure, token management, cost tracking, and agent orchestration.
  • Experience with CI/CD pipelines (GitHub Actions, automated testing, deployment pipelines).
  • Infrastructure as Code experience (Terraform, Pulumi, or similar).
  • Previous work on large-scale distributed systems or platform-as-a-service.
  • Startup experience, you thrive in fast-paced, ambiguous environments.

 

Expectations:

  • You're a generalist who can context-switch between debugging a K8S deployment, setting up a Grafana alert, and configu

Similar Openings for You