Machine Learning Model Serving Overview
Machine Learning Model Serving Overview (Seldon Core, KFServing, BentoML, MLFlow)
TLDR; I’m looking for a way to provide Data Scientists with tools to deploy a growing number of models independently, with minimal Engineering and DevOps efforts for each deployment. After considering several model serving solutions, I found Seldon Core to be the most suitable for this project’s needs
I’m currently building an ML based system for my client.
To give you a simplified context without getting too much into the details — the goal of the ML system is to help the main business system by providing real time predictions based on trained NLP models:
A deeper look inside the ML System will show multiple predictive models — each of them knows how to answer a specific question. The business system needs the ability to query any number of them in different permutations:
Orientation within the ML Space
The 2015 article Hidden Technical Debt in Machine Learning Systems featured the following figure:
In this post we’ll be focusing on the “Serving Infrastructure” part of it.
What is Model Serving?
To understand what model serving is, we’ll examine it from several perspectives: code and workflow.
Let’s start with the code. The following is a basic example taken from SKLearn tutorial:
We could conceptually divide the above code into two fragments: the training phase and the prediction phase.
The training phase ends when we dump the model to a file.
The prediction phase starts when we load it.
We can use the same phases when examining the ML development workflow.
The phases are characterized by which role in the data team is responsible for them, as well as which considerations are taken into account.
While the training phase is in the realm of Data Scientists, in which the considerations are in the lines of which algorithm will produce the best recall and precision rates (and probably many more), the prediction phase is in the domain of Data Engineers and DevOps, where the consideration would be in the lines of:
- How to wrap the prediction code as a production-ready service?
- How to ship and load the dumped model file?
- Which API / Protocol to use?
- Scalability, Throughput, Latency.
- Deployments — How to deploy new model versions? How to rollback? Can we test it using Canary Deployments or Shadow Deployments?
- Which ML frameworks can we support? (e.g SKLearn, TensorFlow, XGBoost, Pytorch etc.)
- How to wire custom pre and post-processing?
- How to make the deployment process easy and accessible for Data Scientist?
Thankfully there are several frameworks that provide solutions to some of the above considerations. We’ll present them in a high-level overview, compare them, and conclude with the one I chose for my project.
- Just a REST API wrapper
- “The K8 Model Serving Projects”: KFServing and Seldon Core
Just a REST API wrapper
As simple as it sounds. For example - Tutorial: Serving Machine Learning Models with FastAPI in Python | by Jan Forster | Medium
The K8 Model Serving Projects
There are two popular model serving projects which are both built on Kubernetes:
KFServing provides a Kubernetes Custom Resource Definition for serving machine learning (ML) models on arbitrary frameworks. It aims to solve production model serving use cases by providing performant, high abstraction interfaces for common ML frameworks like Tensorflow, XGBoost, ScikitLearn, PyTorch, and ONNX.
It encapsulates the complexity of autoscaling, networking, health checking, and server configuration to bring cutting edge serving features like GPU Autoscaling, Scale to Zero, and Canary Rollouts to your ML deployments. It enables a simple, pluggable, and complete story for Production ML Serving including prediction, pre-processing, post-processing and explainability. KFServing is being used across various organizations.
Seldon core converts your ML models (Tensorflow, Pytorch, H2o, etc.) or language wrappers (Python, Java, etc.) into production REST/GRPC microservices.
Seldon handles scaling to thousands of production machine learning models and provides advanced machine learning capabilities out of the box including Advanced Metrics, Request Logging, Explainers, Outlier Detectors, A/B Tests, Canaries and more.
KFServing is a collaboration between several companies that are active in the ML Space (namely Seldon, Google, Bloomberg, NVIDIA, Microsoft, and IBM), to create a standardized solution for common ML Serving problems.
Hence it’s no surprise that the two are sharing similar mechanisms and even code components.
Seldon seems more mature as a project, with more comprehensive documentation, more frequent releases, and a community with an active Slack channel, as well as bi-weekly working group calls. It has proved to be extremely useful for me (now you know which one I chose ;)
Here you can find a detailed comparison between the two.
Next we’ll cover some of there main features:
Seldon introduces the notion of Reusable Inference Servers vs. Non-Reusable Inference Servers.
It provides out of the box Prepackaged Model Servers for standard inference using SKLearn, XGBoost, Tensorflow, and MLflow.
Seldon Core and KFServing have a similar approach to deploying model prediction services, which is based on using Kubernetes CRD (Custom Resource Definition).
— name: classifier
And then you can deploy it with:
kubectl apply -f my_ml_deployment.yaml
Supported API Protocols
As both of these projects are in development, the protocols are changing too.
KFServing has Data Plane (V1) Protocol, while Seldon Core has its own Seldon Protocol. Both have support for Tensorflow Protocol.
The latest efforts regarding protocols are KFServing’s proposal of Data Plane (V2). It is still a work in progress, and so is the pre-packaged server implementing it, developed under Seldon (SeldonIO/MLServer)
A notable difference between the two is that while KFServing is focused on the “simple” use case of serving a single model, Seldon Core allows more complex inference graphs, which may include multiple models chained together with ROUTERS, COMBINER, and TRANSFORMERS.
A simple yet flexible workflow empowering Data Science teams to continuously ship prediction services
Unified model packaging format enabling both online and offline serving on any platform.
100x the throughput of your regular flask based model server, thanks to our advanced micro-batching mechanism. Read about the benchmarks [here](https://www.bentoml.ai/clkg/https/github.com/bentoml/benchmark) .
Deliver high quality prediction services that speaks the DevOps language and integrates perfectly with common infrastructure tools.
BentoML’s approach to creating a prediction service is similar to Seldon Core and KFServing’s approach to creating wrappers for custom models.
The gist of it is subclassing a base class and implementing your prediction code there.
The main notable difference between “The K8 Projects” and BentoML would be the absence of reusable inference servers notion. Although you probably could implement it using BentoML’s framework, it’s approach is generally oriented towards creating non-reusable servers. Meanwhile “The K8 Projects” are explicitly offering both options and have a built-in framework support for re-usable servers (for example a container initializer that loads the model file from storage on boot).
Supported API Protocols
This is another major difference between BentoML and “The K8 Projects”. While the latter is obviously built upon Kubernetes and have a streamlined deployment mechanism described above, BentoML is deployment platform-agnostic, and offers a wide variety of options. An interesting comparison point between those could be attained by looking at BentoML’s guide to deploying to Kubernetes cluster.
All in all, it’s not that different, except for the fact that the absence of re-usable inference servers forces you to build and push docker image for every model you want to deploy. This could become an issue when dealing with a growing number of models, and adds to the complexity of the process.
Another interesting synergy between the discussed solutions is Deploying a BentoML to KFServing.
An MLflow Model is a standard format for packaging machine learning models that can be used in a variety of downstream tools — for example, real-time serving through a REST API or batch inference on Apache Spark. The format defines a convention that lets you save a model in different “flavors” that can be understood by different downstream tools.
Similar to BentoML, we end up with a deployable unit (which is either a containerized REST API server or a python function), and then we have to figure out how to deploy it.
BentoML’s docs goes as far as comparing it to MLFlow.
Here too there is a synergy between solutions, as Seldon Core has a pre-packaged inference server for MLFlow Models.
As I’ve stated before, I chose Seldon Core for this project.
The reasons are:
- Frequent releases.
- Extensive documentation.
- Active community.
- Pre-packaged inference servers.
- Simple deployment process.
In my use case, ideally, I would like to provide Data Scientists with tools to deploy a large number of models independently, with minimal Engineering and DevOps efforts for each deployment. The approach of re-usable inference servers along with CRD based deployment seems to be most suitable for that.
We will contact you as soon as possible.