Is Kubeflow designed for Data Scientists? Yes, but…
Disclaimer: I am a software engineer. I love Machine Learning, though. The last five big projects I’ve been involved in, include Machine Learning and data components. This story is an engineer’s perspective of Kubeflow.
As a software engineer, I’m currently part of a Data Science team of a big mobile software company. As such, I’m involved in the infrastructure side of the ML projects. That includes building systems and infrastructure to run ML from the research phase through data collection and analysis, model development and testing, all the way to integration with the company’s product.
I also love DevOps. For me, creating a Kubernetes cluster, deploying a pod, and exposing it via ingress, is fun! So I guess I can say that I’m also involved in the MLOps side of those projects.
Part of my duty is to find a better infrastructure for the ML projects. This is how I got to Kubeflow. Indeed, Kubeflow is a great tool. It allows its users to take advantage of Kubernetes power to run ML projects. It includes tools for both research, in the face of Jupyter Notebook servers, and for training and testing ML modules, via its great KubeFlow pipeline tool. It also includes models serving tools. There is even a hyperparameter and Neural architecture search tool called Katib. No doubt, it is an impressive suite. It appears right from the start that Kubeflow has some significant benefits:
- It is based on Kubernetes. As a user, you have under your belt the best orchestrating tool, which includes hardware provisioning, auto-scaling, scheduling, and monitoring. Considering most of the products today are using K8s and its vast ecosystem and community — it is by far the most critical trait of Kubeflow.
- It includes most of the ML lifecycle tools. Jupyter Notebook for research, Pipelines for training and experimenting, Serving for deployment.
- The pipeline tool includes experiment management — one of the pain points of model development. The pipeline tool contains a great UI to compare runs, parameters, and the performance of the model. You can understand more easily which parameter influences your model for the better and which one hinders the accuracy.
So, it sounds like Kubeflow is an excellent system, and it is! So I started investigating it, keeping in mind that the target crowd is the Data Scientists in our team. I examined the Jupyter Notebooks servers, Kubeflow Pipelines, and model Serving modules of Kubeflow. Here is my verdict.
Jupyter Notebook Servers
Users can easily create a notebook server. They can choose the server’s hardware requirements (CPU, Memory, and Disk), which image to base it on (which determines the initial suite of python version and packages installed), and can even provide a customized image. You can even request a GPU to run the notebooks on (if your K8s cluster has GPU nodes to spare). Also, starting and stopping each server is simple. Multiple servers can be created for different purposes. It is an excellent tool for Data Scientists, allowing them to easily manage their notebook with no help from DevOps.
No doubt: the Kubeflow Pipelines module is the main ingredient in the dish. It is a tool to create experiments in the form of DAG tasks and schedule them. The tasks can include data gathering and preparations, training of the model, evaluating it, and persisting it for serving (see next section).
Each step can produce artifacts that detail their outcome. The artifacts include visualization, model accuracy measurements, and other goodies that make the task of comparing experiments much simpler.
To create a Kubeflow Pipeline, one needs to perform the following steps:
- Write the code for each step. In most cases, it is a python code that performs the step, such as model training, testing, or evaluating. Sometimes it involves scripts to run some commands, e.g. bash scripts.
- Wrap the code with a Docker image. A Kubeflow Pipeline step is expressed as Operator, much like Airflow operators. The most common and flexible Operator is the ContainerOp, which allows the user to run a docker container. That way, the user can decide in which language/technology to code the step. There is no special requirement from this image other than to finish the task without errors.
- Define the Pipeline. In this step, the user defines the flow of the DAG, the steps it includes, and the dependencies between them. The pipeline is a Kubernetes CRD, defined using a YAML file. In this case, Kubeflow gives a great python SDK called kfp to define the pipeline using a simple python script, avoiding the need to define the YAML or even understand what a CRD is. Such coding of a python script is definitely within the capabilities of the average Data Scientist.
As you can see in the above steps, it looks like an average data scientist will need some help from a DevOp or a software engineer to deploy a pipeline. There are many articles about what it takes to become a data scientist, but almost none of them lists Docker knowledge as a required skill.
It is possible to deploy a pipeline using a Jupyter Notebook. Still, in most cases, it is less recommended to use Jupyter Notebook for the production part of the system, as it is not trivial to manage the source control of a Jupiter Notebook.
Finally, what’s a model worths if you cannot use it to predict? One of the most common ways is to create a server that receives prediction requests, use the model to predict, and return an answer.
Kubeflow includes many model-serving options, which are all involved in defining the server using YAML and deploying it to K8s. It is definitely, another step that is not in the set of the capabilities of the average Data Scientist. A DevOps person is required to help define the service and take care of cross-cutting concerns involved in such components, such as security, monitoring, and continuous deployment.
I would expect Kubeflow to add some UI capabilities to deploy a model serving service and take care automatically of at least some of the concerns mentioned above.
As AI and ML projects are becoming so popular, there are great tools out there to manage the lifecycle of the ML project. New tools are appearing like mushrooms after rain. Those tools are aiming at reducing some of the labor mentioned above and performing those tasks in more straightforward ways. Some of those tools are even built upon Kubeflow.
Such a project is MLRun.
Kubeflow is an essential step towards the right direction of making production-grade ML-based systems achievable for Data Scientists. However, an average Data Scientist will find it hard to operate such a system without the help of a DevOps or an engineer.
I believe that as time goes by, the chore of building an ML-based system will become more and more simple to data scientists without the need to practice DevOps.
An example code of KubeFlow Pipeline: https://github.com/hcloli/kubeflow-natality-example
I’m going to present my experience with Kubeflow in a meetup on July 7th, together with Yaron Haviv from Iguazio. I’d love to see you there — https://www.meetup.com/full-stack-developer-il/events/271182548/
Is Kubeflow designed for Data Scientists? Yes, but… was originally published in Everything Full Stack on Medium, where people are continuing the conversation by highlighting and responding to this story.