AI Engineer | AI-Ops Agent Development
xaasio
Job Description
Your Day-to-Day Impact
- The candidate will be responsible for:
- Designing and developing AI-Ops agents for cloud, infrastructure, platform, and SRE operations.
- Building AI agents for platforms such as:
- Kubernetes
- OpenStack
- CEPH
- PostgreSQL
- MariaDB
- Kafka
- OpenSearch
- Grafana
- Zabbix
- Linux
- XaasIO CMP
- XaasIO MLT
- Developing agent workflows for:
- Alert triage
- Log analysis
- Metrics analysis
- Event correlation
- Root-cause analysis
- Incident summarization
- Runbook recommendation
- Remediation planning
- Change impact analysis
- Post-change validation
- Post-incident review support
- Compliance validation
- Security posture analysis
- Automated operational report generation
Building RAG-based knowledge systems using runbooks, SOPs, architecture documents, platform documentation, logs, tickets, alerts, monitoring data, security scan reports, compliance reports, and incident history.
- Integrating AI agents with observability and operations platforms such as:
- Grafana
- Prometheus
- OpenSearch
- Zabbix
- Alertmanager
- Wazuh
- ITSM tools
- CI/CD systems
- Git repositories
- Ansible / AWX
- OpenTofu / Terraform
- Building safe agent workflows with human-in-the-loop approvals before executing production-impacting actions.
Creating automation playbooks and remediation workflows using Python, Ansible, APIs, shell scripts, and event-driven automation.
- Developing agent tools and connectors for:
- Kubernetes API
- OpenStack APIs
- CEPH APIs
- Linux system commands
- PostgreSQL / MariaDB APIs
- Monitoring APIs
- Logging APIs
- ITSM APIs
- CI/CD APIs
- DevSecOps tool APIs
- Designing guardrails for AI agent actions, including:
- Role-based access control
- Approval workflows
- Audit logging
- Dry-run mode
- Policy validation
- Change window validation
- Rollback checks
- Secrets protection
- Security baseline validation
- Safety checks before remediation
Implementing DevSecOps and CI/CD pipeline integrations for automated validation, secure build processes, security scanning, compliance checks, and deployment approvals.
Integrating SAST, DAST, SCA, container image scanning, IaC scanning, secrets scanning, SBOM generation, vulnerability checks, and policy-as-code gates into development and deployment workflows.
Evaluating and integrating open-source AI agent frameworks, AI platform engineering tools, and AI-Ops reference architectures.
- Developing PoCs, demos, technical documentation, architecture diagrams, test cases, and customer-facing presentations.
Troubleshooting agent behavior, hallucination risks, prompt failures, tool-calling errors, data quality issues, model performance issues, security scan failures, pipeline failures, and infrastructure integration problems.
Skills You Bring to the Table
Bachelor’s or Master’s degree in Computer Science, Artificial Intelligence, Machine Learning, Data Science, Information Technology, Engineering, Cybersecurity, or equivalent practical experience.
Certifications in AI, data science, Kubernetes, Linux, cloud, DevOps, DevSecOps, cybersecurity, or security compliance will be an added advantage.
- The candidate should have hands-on experience in:
- Python programming
- LLM application development
- AI agent development
- Prompt engineering
- RAG pipeline design
- Vector databases
- REST API integration
- Linux fundamentals
- Git-based development workflow
- Docker and containerized application deployment
- Kubernetes basics
- Observability fundamentals: logs, metrics, events, and traces
- Automation scripting using Python and Shell
- DevOps practices and infrastructure operations workflows
- Hands-on exposure to CI/CD pipelines
- CI/CD tools such as GitHub Actions, GitLab CI/CD, Jenkins, Argo CD, Tekton, or similar
- Building, testing, packaging, and deploying applications through CI/CD workflows
- DevSecOps practices and secure software delivery workflows
- SAST, DAST, SCA, and container image scanning
- IaC scanning and secrets scanning
- SBOM generati