Senior Software Engineer(AI/ML Platform)
myworkdayjobs
Job Description
Responsibilities
-
Design and Implement Scalable AI/ML Serving Systems: Develop scalable and efficient systems for serving AI/ML models, ensuring that these systems can handle varying loads and perform with low latency across diverse environments
-
Hybrid Cloud Architecture Management: Architect and manage a hybrid cloud environment that uses both on-premises resources and multiple cloud platforms (e.g., AWS, Azure, GCP) to optimise performance, cost, and scalability
-
Model Deployment and Versioning: Oversee the deployment of AI/ML models into production, including the setup of CI/CD pipelines for model deployment and versioning, ensuring smooth and reliable model updates and rollbacks
-
Performance Monitoring and Optimization: Implement monitoring tools and practices to track the performance of AI/ML models in production, identifying bottlenecks and optimizing system and model performance for better efficiency and reduced costs
-
Security and Compliance: Ensure that the AI/ML serving systems follow industry standards and regulatory requirements for data security and privacy, including the management of data encryption, access controls, and audit trails
-
Collaboration and Leadership: Work closely with AI/ML researchers, data engineers, and other partners to translate complex AI/ML models into production-ready systems, providing technical guidance throughout the project lifecycle
-
Research and Innovation: Stay informed about the latest developments in AI/ML technologies, cloud computing, and software engineering practices, exploring and integrating solutions that can enhance the capabilities and efficiency of the AI/ML serving platform
Minimum Qualifications
-
Educational Background: BS or MS in Computer Science, or equivalent practical experience
-
Experience: 5+ years of experience in software development and engineering, with a solid record of delivering production systems and services
-
Expertise in AI/ML Technologies: Hands-on experience with AI/ML frameworks (such as TensorFlow, PyTorch) and familiarity with the lifecycle of AI/ML model development, from training to deployment
-
Proficiency in Programming Languages: Strong coding skills in languages commonly used in AI/ML and system development, such as Python
-
Experience with Cloud Technologies: Experience with designing and managing systems on hybrid cloud architectures, including working knowledge of cloud service providers like Azure
-
Knowledge of Containerization and Orchestration Tools: Familiarity with containerization technologies (e.g., Docker) and orchestration systems (e.g., Kubernetes), crucial for deploying and scaling applications in a cloud environment
-
Understanding of DevOps Practices: Knowledge of CI/CD pipelines, infrastructure as code, and other DevOps practices to ensure smooth deployment and operation of AI/ML systems
-
System Performance Optimization: Deep understanding of performance metrics and latency optimization techniques, with the ability to diagnose, tune, and enhance the efficiency of serving systems
Preferred Qualifications
-
Cloud Certifications: Certifications in cloud technologies from major providers (AWS Certified Solutions Architect, Google Cloud Professional Cloud Architect, Microsoft Certified: Azure Solutions Architect Expert), indicating a high level of expertise in cloud services and architecture
-
Experience with Big Data Technologies: Experience with big data technologies and ecosystems (Hadoop, Spark, Kafka) for processing and analyzing large datasets in a distributed computing environment
-
AI/ML Model Monitoring Tools: Familiarity with tools and frameworks for monitoring and managing the performance of AI/ML models in production (e.g., MLflow, Kubeflow, TensorBoard)
-
Prior experience of on-call rotation for tier-1 services with 24x7 support mechanism
-