Ops / SRE Support Engineer

vialto

Bengaluru, India 10 Years Exp Posted 7d ago

Job Description

AI & Solution Architecture 

  • Define end-to-end architecture for Generative AI, agentic systems, and AIOps platforms 

  • Design scalable, secure, and resilient AI systems across cloud and hybrid environments 

  • Establish reference architectures, design patterns, and best practices for AI systems 

Agentic & GenAI Systems 

  • Architect multi-agent systems, orchestration workflows, and tool integration frameworks 

  • Define patterns for RAG, retrieval pipelines, and knowledge integration 

  • Design evaluation frameworks, guardrails, and Responsible AI controls 

  • Guide implementation of agent frameworks and workflow orchestration 

  • Design patterns for implementation with Agentic Harness 

AIOps & Observability 

  • Architect intelligent operations platforms leveraging logs, metrics, traces, and events 

  • Define strategies for anomaly detection, alert correlation, and automated remediation 

  • Establish observability standards using Open Telemetry (OTEL) and modern monitoring tools like Tempo, Loki, Prometheus, Grafana, etc. 

Cloud & Platform Engineering 

  • Lead architecture for AI solutions on Microsoft Azure 

  • Design cloud-native systems using containers (Docker, Kubernetes) 

  • Define data architecture across PostgreSQL, Cosmos DB, SQL Server, and telemetry pipelines 

Engineering & Delivery Enablement 

  • Guide teams on backend architecture using Python and FastAPI 

  • Define CI/CD, DevOps, and automation strategies using ADO and GitHub 

  • Establish standards for testing, evaluation, and performance optimization of AI systems 

Security, Governance & Responsible AI 

  • Define enterprise standards for AI security, privacy, and compliance 

  • Implement Responsible AI practices including: 

  • Prompt safety 

  • Data protection 

  • Access control 

  • Model governance 

  • Ensure auditability and reliability of AI-driven systems 

Operational Excellence 

  • Drive improvements in MTTR, MTTD, reliability, and operational efficiency 

  • Define SLIs, SLOs, and error budgets for AI-powered services 

  • Architect automation for incident response and root-cause analysis 

Leadership & Collaboration 

  • Partner with SRE, platform, data, and application teams 

  • Mentor engineers and guide architectural decision-making 

  • Influence AI strategy and roadmap across the organization 

  • Apply strong communication and stakeholder management skills

Similar Openings for You