Sr. MLOps Engineer
carrier
Job Description
Primary responsibilities:
-
Develop and manage deployment processes for machine learning models, ensuring seamless integration into production environments
-
Design and implement automated CI/CD pipelines for ML workflows, adhering to company standards and best practices
-
Create and maintain monitoring tools to track model performance, reliability, and accuracy in production
-
Optimize infrastructure for model training, testing, and deployment, including the development of template scripts and automation to accelerate the development process
-
Collaborate with data scientists, data engineers, and platform engineers to streamline ML operations and integrate new AI technologies into the platform ecosystem
-
Ensure security and compliance of ML models and workflows with industry standards, regulations, and company governance frameworks
-
Research and integrate best practices and new technologies in MLOps to improve efficiency and effectiveness
-
Assist in the creation and implementation of rigorous evaluation and validation processes for ML models, focusing on automation of validation scripts for deployment
-
Contribute to the development and maintenance of training materials and user guides for the AI platform
Experience and Skills Required:
-
Bachelor's or Master's degree in Computer Science, Software Engineering, or a related field
-
6+ years of experience in software engineering or DevOps, including at least 2-3 years of hands-on experience with machine learning operations or AI platform engineering
-
Demonstrated experience in deploying and maintaining machine learning models in production environments
-
Strong programming skills in Python and proficiency with shell scripting
-
Extensive experience with CI/CD tools (e.g., Jenkins, GitLab CI, or Azure DevOps)
-
In-depth knowledge of containerization technologies (e.g., Docker) and orchestration platforms (e.g., Kubernetes)
-
Familiarity with cloud platforms (e.g., AWS, Azure, or GCP) and their ML-specific services
-
Practical experience with ML frameworks such as TensorFlow, PyTorch, or scikit-learn
-
Strong understanding of data pipelines, ETL processes, and data storage solutions
-
Experience with monitoring and logging tools (e.g., Prometheus, Grafana, ELK stack)
-
Excellent problem-solving skills and ability to optimize complex systems
-
Strong communication skills and ability to work effectively in a collaborative environment
-
Knowledge of data governance, security best practices, and compliance regulations related to AI/ML
-
Experience with version control systems (e.g., Git) and ML model versioning tools (e.g., MLflow, DVC)