Ops / SRE Support Engineer
vialto
Job Description
AI & Solution Architecture
-
Define end-to-end architecture for Generative AI, agentic systems, and AIOps platforms
-
Design scalable, secure, and resilient AI systems across cloud and hybrid environments
-
Establish reference architectures, design patterns, and best practices for AI systems
Agentic & GenAI Systems
-
Architect multi-agent systems, orchestration workflows, and tool integration frameworks
-
Define patterns for RAG, retrieval pipelines, and knowledge integration
-
Design evaluation frameworks, guardrails, and Responsible AI controls
-
Guide implementation of agent frameworks and workflow orchestration
-
Design patterns for implementation with Agentic Harness
AIOps & Observability
-
Architect intelligent operations platforms leveraging logs, metrics, traces, and events
-
Define strategies for anomaly detection, alert correlation, and automated remediation
-
Establish observability standards using Open Telemetry (OTEL) and modern monitoring tools like Tempo, Loki, Prometheus, Grafana, etc.
Cloud & Platform Engineering
-
Lead architecture for AI solutions on Microsoft Azure
-
Design cloud-native systems using containers (Docker, Kubernetes)
-
Define data architecture across PostgreSQL, Cosmos DB, SQL Server, and telemetry pipelines
Engineering & Delivery Enablement
-
Guide teams on backend architecture using Python and FastAPI
-
Define CI/CD, DevOps, and automation strategies using ADO and GitHub
-
Establish standards for testing, evaluation, and performance optimization of AI systems
Security, Governance & Responsible AI
-
Define enterprise standards for AI security, privacy, and compliance
-
Implement Responsible AI practices including:
-
Prompt safety
-
Data protection
-
Access control
-
Model governance
-
Ensure auditability and reliability of AI-driven systems
Operational Excellence
-
Drive improvements in MTTR, MTTD, reliability, and operational efficiency
-
Define SLIs, SLOs, and error budgets for AI-powered services
-
Architect automation for incident response and root-cause analysis
Leadership & Collaboration
-
Partner with SRE, platform, data, and application teams
-
Mentor engineers and guide architectural decision-making
-
Influence AI strategy and roadmap across the organization
-
Apply strong communication and stakeholder management skills