Getting MLOps right takes a lot of work. There are no shortcuts. According to Forrester, your AI Transformation is Doomed without MLOps.
There are many challenges on the road to MLOps, which we’ve outlined previously. We’ve also described an approach to get started in MLOps.
At LatentView, we have been building and deploying Machine Learning models since 2006. In the last 14 years, we’ve worked with 95 companies across 7 industry verticals in helping build and deploy ML infrastructure. During the pre-cloud era (sounds ancient, but it was real), we used to work with on-premise structured data warehouses, do the modeling on large dedicated servers, parallelize training using MPI (message passing interface) and orchestrate serving and retraining using IT process automation software such as BMC. Today, we use the cloud, containers, and a sophisticated alphabet soup of technologies that provide easy to use abstractions for managing all aspects of the ML model life cycle.
Over the years, we’ve found that companies go through five stages as they start with an ML program towards becoming a very sophisticated algorithmic colossus. Here are the 5 stages below, and hopefully, you’ll find yourself in one of these stages and find the advice helpful to progress to the next stage.
Stage 1: Successful ML Pilot
Description of the Stage
- This is the stage when everyone in your company or function or department is excited about the first few use cases of ML.
- The leader (company-wide, functional, or departmental) has defined a mandate to improve customer service, reduce costs, or automate tasks using ML.
- There are pilots underway, and many of them provide an early promise of significant positive impact.
Key Risks
- Initial ML models may be underwhelming due to the use of limited data.
- Even with good ML models (i.e., accurate predictions), it’s not easy to translate accurate predictions into great business value.
Disciplines to address the Risks
- Committed ownership of the data strategy needs to be complemented by iterative execution to avoid failures and overcome initial challenges in implementation
- You need to think hard about a framework for translating ML predictions into business decisions. You probably need an ML Canvas to help you start converting predictions into decisions.
Stage 2: First Deployment
Description of the Stage
- You have successfully completed proofs-of-concept in ML. You can make accurate predictions with the data that is available (internal and external data). You have done the hard work of translating the models into decisions.
- The models are deployed and managed by the data science team. For example, they may wrap up the model in a flask engine for providing real-time predictions through an API, or they periodically run batch scoring programs that deliver model scores that are embedded into a database.
Key Risks
- , The data science team is swamped with operational responsibilities of managing the model serving and integration. Occasionally they deliver an update to the models based on new data/algorithm. As the no. of models increases, they take up more of your data science team capacity,
- The data science team works heroically but inefficiently, focusing on the wrong priorities. They don’t have a way of organizing separate environments for development, testing, and production, so everything is in a bit of a mess. They are wondering if there is a better way than babysitting the data pipelines and manually refreshing models. They do not have the bandwidth to work on many other problems that can benefit from ML models.
Disciplines to address the Risks
- IT steps in to take ownership of the model serving and deployment, while data science manages the model development and refresh. Now the data science team has the bandwidth to innovate and experiment with new approaches.
- IT also makes things easier for the data science team by providing the right infrastructure, such as dev/test/prod environments and data.
Stage 3: IT does the serving
Description of the stage
- You have a few models, and you have successfully involved IT in managing deployment of the ML models in a dedicated production environment.
- IT maintains the data pipelines that provide the input data and transform them into features needed for the models. IT and data science collaboratively manage the versioning of the model objects.
- The data science team simply provides the updated model and the transform logic for the features using Jupyter notebooks as the main workbench and source code delivery. And they continue to work on the new models.
Key Risks
- There is a lot of repetitive and routine work for the data science team. As the no. of models increase, the data science team needs to provide updated models to the IT team regularly. Not all of these are major updates, such as involving a change in how the problem is defined, but incremental updates that take advantage of more recent data and feedback.
- The current process of collaboration leads to rework and errors. The IT team translates the code in the Jupyter notebook into production quality pipelines. Apart from the rework involved, this may lead to subtle differences in the code resulting in training-serving skew and unexpected predictions.
- There is very little modularity or reuse in the approach. Moreover, there is no straightforward way to scale the infrastructure during training.
Disciplines to address the Risks
- Package and deploy the Training Pipeline, not just the ML Models. IT creates the ML stack to enable data scientists to deliver orchestrated experimentation pipelines. IT takes the experimentation code, packages it, and deploys the training pipeline with the experiment code into the production environment. The model will be generated in the production environment from the training pipeline.
- Create infrastructure for using containers to train and deploy models. This could be in the form of tools such as KubeFlow or AWS SageMaker. This enables data science to use arbitrary frameworks to train their models, thus providing a clear separation between building and scaling the models.
Stage 4: Continuous Integration
Description of the stage
- Based on learnings from DevOps, you deploy an environment where data science teams can create an orchestrated training pipeline for the ML models, leveraging unit and integration tests to ensure the model converges, the feature transforms logic is giving expected results, etc.
- Once the data science team completes the model development and the tests, IT simply takes this orchestrated training pipeline, packages and deploys it into production.
- Based on collaboration with data scientists, IT engineers have defined business logic that triggers retraining of models based on predefined schedules or simple changes in model input/output.
- Now the data science team, freed up from the need to frequently update the models, has significantly expanded the scope of ML. They want to embed ML models into every business process where a large number of decisions are made through judgment.
Key Risks
- Manual deployment becomes a bottleneck to scale this process. This is especially true as the no. of models increases significantly. Ensembles become the norm, and model chaining becomes widespread (i.e., the output from one model used as features into another model).
- Now that the training is automated, there is a need for more monitoring of the models to identify early warning signals and potential risks.
- The reuse is limited to the processing level i.e., the pipelines. There is a lot of duplication that still happens at the level of features.
Disciplines to address the Risks
- Continuous Deployment to automate the packaging and deployment of models into the production environment. In the previous stage, the training pipeline was deployed in production, but the deployment was still manual. In this stage, the deployment of the models and the pipeline is fully automated, and there’s no need for IT involvement in manual deployment. This completely frees up the IT to focus on implementing artifacts such as a metadata store, feature stores, etc.
- More sophisticated monitoring to ensure drift detection, changes in prediction distribution, etc. There are many types of drifts that can happen, and these are monitored by logging input data at serving, the model scores, the operational metrics such as latency, etc. The metrics can be aggregated over time to identify patterns and anomalies in the data/output.
- Feature stores to reuse features, reduce rework, and avoid duplicates. IT implements a feature store based on all the features implemented by data scientists, whether they are used or not. This provides single ground truth for features that span all environments – dev, test, and production.
Stage 5: Full Automation and Monitoring
At this stage, the MLOps stack contains capabilities for automatically integrating new models and enhancements into the production environment, irrespective of which tools data scientists use. Feature stores provide single ground truth for features. Monitoring is sophisticated. Training is continuous and automatic.
All this leads to a very productive and happy data science team: productive because they can get more things done quickly and happy because they make a significant positive impact on the business. The IT team is productive. They can focus on improving the infrastructure rather than managing operational tasks and happy because they could translate the promise of data science into significant benefits for the business.
Key Risks
- The key risk is the loss of edge that may arise due to changing business needs and/or technological breakthroughs.
Disciplines to address the Risks
- The ability to manage the performance of the models as a portfolio, through a continuous process of invention and experimentation by data scientists and business teams.
- Sensing risks to the business by constantly scanning the environment for problems and opportunities and driving constant improvements to the MLOps stack.
Conclusion
We hope that this 5-stage framework helps you understand where you are in your MLOps maturity and the challenges that you are expected to face in the near future. This can serve as a guide to help you make decisions about your infrastructure that can enable you to leapfrog ahead and avoid getting stuck at any stage.
It used to be the case that MLOps is only needed for internet-scale businesses that rely on a large number of models to drive automated decisions. However, today, every business is becoming digital, and ML Models are playing a key role in improving business performance in a variety of ways. Consequently, MLOps capabilities are critical to surviving and thriving in this world.