AI/ML DevOps Engineer, AS
db
Job Description
Your key responsibilities
- Manage Incident, Service, Problem and Change Management of Shared AI Platforms
- Monitor production AI/ML models for performance, latency, accuracy, data drift and model drift, and proactively troubleshoot production issues.
- Automate Model Packaging, versioning and rollbacks.
- Monitor model inference speed, latency and accuracy.
- Optimize resource allocation for cost-effective AI workloads.
- Detect and mitigate data drift affecting model performance.
- Troubleshoot model failures, latency issues and deployment errors.
- Collaborate with L3 engineers and data scientists for escalations.
- Utilize containerization technologies like Docker to package models and dependencies.
Continuous Integration/Continuous Deployment (CI/CD):- Develop and maintain CI/CD pipelines for automating the testing, integration, and deployment of ML models.
- Implement version control to track changes in both code and model artifacts.
Monitoring and Logging: - Establish monitoring solutions to track the performance and health of deployed models.
- Set up logging mechanisms to capture relevant information for debugging and auditing purposes.
- Optimize ML infrastructure for scalability and cost-effectiveness.
- Implement auto-scaling mechanisms to handle varying workloads efficiently.
- Enforce security best practices to safeguard both the models and the data they process.
- Ensure compliance with industry regulations and data protection standards.
- Oversee the management of data pipelines and data storage systems required for model training and inference.
- Implement data versioning and lineage tracking to maintain data integrity.
- Collaborate with DevOps teams to align MLOps practices with broader organizational goals.
- Continuously optimize and fine-tune ML models for better performance.
- Identify and address bottlenecks in the system to enhance overall efficiency.
- Maintain clear and comprehensive documentation of MLOps processes, infrastructure, and model deployment procedures.
- Document best practices and troubleshooting guides for the team.
Your skills and experience
- Excellent communication and presentation skills, highly organized and disciplined.
- Experienced in working with multiple stakeholders. Ability to create and naturally maintain good business relationships with all stakeholders.
- Comfortable working in VUCA (Volatility Uncertainty Complexity Ambiguity) and highly dynamic environments.
- Expertise on the products/technologies below is required:
- Google Cloud – GKE, Terraform, IAM, BigQuery, Cloud Shell, Cloud Storage
- AI/ML – AI Agents, AI concepts, ML models, AI/ML Concepts, Vertex AI, AutoML, BigQuery ML.
- MLOps & CICD Pipelines, Kubeflow, Vertex AI pipelines
- Proficiency in Designing, deploying and managing AI agents e..g chatbot, virtual assistants
- GCP Networking, Networking protocols, Security concepts, VPC, Load balancers
- Unix servers very basic administration
- Python, Shell Scripting, SQL
- Familiarity with fine-tuning and deploying large language models on GCP.
- Understanding of security best practices, including data governance, encryption, and compliance with AI-related regulations.
- GCP - Cloud Logging, Cloud Monitoring and AI Model Performance Tracking.
- 4+ years of work experience in IT; (for AVP – 6+, Associate – 4+)
- Strong problem-solving skills and a passion for AI research
- Good inter-personal skills with ability to co-operate and collaborate with other teams
Educational Qualifications:
- B.E. / B. Tech. / master’s degree in computer science or equivalent
- Added advantage. –
- GCP Certifications
- Kubernetes Certifications
- AI/Ml Educational background or Certifications or higher qualifications.
How we’ll support you
- Training and development to help you excel in your career
- Coaching and support from experts in your team
- A culture of continuous learning to aid progression
- A range of flexible benefits that you can tailor to suit your needs