Written by Łukasz Murawski; and peer reviewed by Mulugheta Solomon
Machine Learning (ML) is capable of solving a variety of real-world problems. But an ML model can only prove its value when it is exposed to the public. That is why production deployment is a crucial step of the the the ML lifecycle.
And just as ML model development is an iterative process, so is deployment. The sooner feedback from end-users is collected, the faster model can be improved, and the better ML solution can be delivered. This, in turn, results in a model having greater impact and higher business value.
Shortening the time to market is important, but so is the reliability and maintainability of the whole solution. That turns out to be challenging, especially in ML systems, since there is not only code and infrastructure that need to be controlled but also the data.
In recent years, the community of ML practitioners identified the biggest pain points and proposed best practices and frameworks to be used for efficient continuous delivery and automation of machine learning workflows, so-called Machine Learning Operations (MLOps). It derives from already existing concepts of DevOps (from software engineering) and DataOps (data engineering) but additionally deals with issues related solely to ML.
ML-enabled system challenges
The ML model is the heart of every ML-based solution. Although crucial, it is just one of many pieces that need to be in place before release. The paper “Hidden Technical Debt in Machine Learning Systems” ¹ was one of the first ones, pointing out that only a small fraction of real-world ML systems are composed of the ML code:
Technical debt has become a common metaphor for the accumulation of software design and implementation choices that seek fast initial gains but that are suboptimal and counterproductive in the long run. Authors claim that ML systems are especially susceptible to suffering from technical debt because they have all problems of traditional software engineering plus an additional set of ML-specific challenges. Let’s try to understand some of them.
In production systems, data should be treated as a first-class citizen. If you do not control the data, you can only hope that the system will produce the results it is supposed to. A reliable data collection process is a must because data dependencies cost more than code dependencies. Data cleaning and verification should also contain tight schema control. And it’s not enough just to handle missing values, anomalies, and data types.
Differences in distribution between training and online data may cause so-called training/serving skew. So, for example, if one group of customers is, in reality, significantly more frequent than in the training dataset, the overall performance will lean towards the performance achieved by that particular group in the training process. And it will be different than we expect.
The other problem is so-called data drift. It’s a natural situation when live data distribution gradually changes over time. Like when the younger population is entering the market or traffic has been progressively switched from one source to another.
It’s also important to monitor provenance (where the data come from) and lineage (the sequence of steps needed to get to the end of the data pipeline). Not just for the sake of reproducibility and debugging but also for legal compliance.
Managing hardware configurations in different environments is challenging. Runtime versions, computation resources, security, access rights, load balancing, etc. Usually, training is resource-intensive and requires high memory and high CPU/GPU machines, while inference requires smaller but reliable and scalable instances serving prediction requests with low latency.
Meaning features that have been used by a particular model version, how data was extracted, a wide variety of ML algorithm-specific settings, potential pre-or post-processing steps, and verification methods. Packaging, software versions, and dependencies.
Monitoring and alerting
ML models are prone to performance degradation almost by design. Why? Because they are at their best at the release time. Over time, models get worse, which we call model staleness, model drift, or model decay. Most often, this is due to gradually changing data (data drift) or the fact that the “World” that the model learned to predict has changed. The latter is called concept drift. You’ll notice it when nothing but model performance deteriorates. It means that the relationship between model input and output does not hold anymore because of external factors. For example, a customer’s behavior suddenly changed due to COVID19.
Additionally, ML systems consist of many building blocks. It’s important to monitor every piece in order to identify the roots of problems as quickly as possible. Not only model performance metrics (quality of produced predictions) but also data quality and distributions, business metrics (like ROI or CTR), and hardware (availability, latency, failure ratio, resource usage).
The biggest post-deployment challenge is rolling out new or retrained model versions. It is easy if there’s just one live model deployed in a single place, scoring offline batches of data on the regular basis. But if the model is deployed to multiple instances, serving online traffic with strict availability requirements, then in such cases, updating is not trivial. There are different deployment patterns:
1. Shadow mode: new model works in parallel but not making any decisions yet. The system’s output is logged for analysis.
2. Blue/Green: where Blue is an old version of a system and Green is a new one. Traffic can be shifted from blue to green endpoints with a different strategy (All at once, canary, linear/gradual ramp-up)².
In case something goes wrong, we need to be able to quickly roll back to the old version. Without a proper system in place (load balancers, endpoints, model repositories, dashboards), it’s a lot of complicated, manual work.
What is MLOps?
MLOps is a concept of facilitating fast and reliable transitions between stages of the ML project lifecycle. It’s a methodology for ML engineering that unifies ML system development (the ML element) with ML system operations (the Ops element)³.
It introduces sets of standards and methodologies for building, deploying, and operationalizing ML systems rapidly and reliably and at scale. It handles typical challenges related to application deployment, like resilience, queries per second, load balancing but also ML-specific ones mentioned in a paragraph above.
“When you deploy an ML model, you also need to worry about changes in data, changes in the model, users trying to game the system, and so on. This is what MLOps is about”. ⁴
CI/CD is at the core of software development automation. It’s a concept of frequent (continuous) integration (CI) of new code with the existing codebase and automatic delivery or even deployment (CD) to the production-ready environment. In most cases, the letter “D” of CD abbreviation stands for continuous Delivery, not continuous Deployment. It means that artifacts are built and delivered to the production environment but are not automatically replacing working pieces. For that to happen, humans usually still need to sign off.
CI starts with the developer pushing a chunk of a fresh and shiny piece of functionality to the code repository (e.g. GitHub or Gitlab). CI consists of automatic code quality evaluation, testing, integration with the current repository, and packaging (e.g. standalone python package or a container with the code package).
Created artifacts are then delivered to the target environment and secured, provisioned, approved for serving live traffic, and deployed manually or automatically.
It’s a common practice that Continuous Delivery is implemented solely on major branches like DEV, UAT, and PROD. It is because it makes sense to create new artifacts only after successfully integrating new functionality with the existing codebase (successful pull request).
However, ML systems present unique challenges to core DevOps principles like Continuous Integration and Continuous Delivery (CI/CD). For example, artifacts mentioned before are not only a single software package or a service but a system (usually an ML training pipeline) that should automatically deploy another service (model prediction service).
A new property unique to ML systems is Continuous Training (CT), which is dedicated to automatically retraining models when their performance degrades. It’s a viable option when data drift occurs. For more on that topic, refer to the “Continuous Training of Machine Learning Models “.
ML pipeline is a set of steps related to machine learning workflow. At a high level, it usually consists of the following:
To construct ML pipelines, components need to be reusable, composable, and potentially shareable across ML pipelines. Therefore, while the EDA code can still live in notebooks, the source code for components must be modularized. In reality, ML pipelines have a lot of interdependent steps thus, they are often represented as DAGs (Directed Acyclic Graph). A DAG is a collection of tasks organized in a way that reflects their relationships and dependencies. One reason why it’s better to schedule and run ML workflows as pipelines rather than separate function calls is consistency. It means having the same ordered steps for each and every run. That prevents differences between data processing during the training and prediction phase. It provides reproducibility while moving between environments (DEV->UAT->PROD) and facilitates easy hyperparameters optimization of the whole system (not just a model).
In some scenarios, the data and feature processing pipeline (DataOps) is separated from MLOps pipeline:
In that case, the DataOps part ends with saving features to a database or a Feature Store (a specialized version of a database designed for storing and serving ML Features with low latency) like Feast. In this scenario, MLOps part starts with extracting already curated features:
Artifact repositories and ML metadata tracking
MLOps means automating and controlling ML systems at every stage of the ML development lifecycle in order to bring efficiency and shorten development to deployment time. To achieve that, modern ML Platforms introduce new building blocks. For example, Artifact Repositories for storing data schemas, data statistics, and trained models. ML Metadata Stores that keep track of metadata related to each pipeline step along with associated input and output artifacts. Those components facilitate debugging, audit, reproducibility, and collaboration.
There’s a huge variety of ML tools in the MLOps world. Some of them aim to provide only selected functionalities while others are designed as complete end-to-end MLOps platforms. Many of those tools greatly simplify MLOps work, taking off some heavy lifting, although sometimes at the cost of lower flexibility. Here’s a great article if you want to find out more about the whole ecosystem.
My goal here is only to present a few popular open source MLOps tools. From very simple Unix Make, that out of the box offers only execution of dependent scripts, through popular general-purpose orchestrator AirFlow, up to dedicated e2e platforms like Kubeflow and TensorFlow Extended that supports many MLOps specific concepts.
One of the simplest ways to manage and run dependent scripts is to use Make. Originally, the tool was created on Unix-based platforms, although Windows implementation is now also available. It utilizes a simple file called Makefile, where you define rules with a series of commands necessary to accomplish a step. Those steps can be grouped, and basic dependencies can be created between them. For example, ML jobs could be represented in the following way:
In this example, feature engineering can be executed using a simple make feature_engineering bash command. Similarly, triggering an end-to-end pipeline would be as simple as running: make e2e_run from CLI. Make is a method of choice in a popular repository template for Data Science: Cookiecutter Data Science. It is a convenient tool for running small and simple jobs on a single machine. It doesn’t give any extra functionalities beyond what you explicitly code in your scripts. As soon as our project grows, we might need a more scalable solution capable of executing jobs across clusters of machines with a richer user interface and monitoring. This is where AirFlow comes in handy.
Airflow is an orchestration framework. It’s an open-source tool used for developing, scheduling, and tracking multi-stage workflows. Airflow has great community support and is easy to use. Even complicated workflows are created via relatively simple Python scripts. DAGs are at the heart of AirFlow. They are used to manage workflow orchestration. Developers define tasks and dependencies in a python script, and then Airflow manages the scheduling and execution. It offers an interactive GUI with robust monitoring.
You can install AirFlow via pip and create DAGs locally before moving them to the remote cluster. It is also offered as managed service by major cloud providers.
It’s popular among Data Engineers who use it for scheduling ETL jobs. That skill set already existing in many organizations often determines Airflow as the tool of choice for scheduling ML pipelines as well.
AirFlow is a general-purpose tool. It does not specifically address challenges related to ML systems. That’s why there are other, more specialized platforms.
Kubeflow is an open-source end-to-end platform for developing, deploying, and managing portable and scalable enterprise-grade ML apps. It contains a set of specialized ML tools and frameworks. It addresses ML-specific needs like resource-intensive training phase and lightweight, low latency scalable inference.
It’s built on top of Kubernetes, an open-sourced platform for running and orchestrating containers.
Kubeflow is built upon three principles:
1. Composability – each ML stage is an Independent System meaning that at each stage you can only choose pieces that are necessary at that point
2. Portability – run the pipeline in the same way in different environments (laptop with kind, on-premise, in the cloud).
3. Scalability – resources on-demand, utilized only when necessary
Kubeflow platform consists of the following components:
- Kubeflow Notebooks
- Kubeflow Pipelines
- KFServing/KServe (standalone deployment module)
- Katib (AutoML for hyperparameter tuning and neural architecture search)
- Central Dashboard
ML workflows are developed using Kubernetes Pipeline. The pipeline includes the definition of the inputs (parameters) required to run the pipeline and the dependencies between them that form a graph. Inputs and outputs of each component are strictly defined and controlled.
When you run a pipeline, the system launches one or more Kubernetes Pods corresponding to the steps (components) in your workflow (pipeline). Pods start Docker containers that, in turn, start your programs. Each experiment runs in a separate Kubernetes environment. Experiments’ results are presented in a dashboard that facilitates tracking artifacts and executions.
TensorFlow Extended (TFX) is also a platform for building and managing production-grade ML workflows. Developed by Google, it derives from lessons they learned building large-scale systems on Kubeflow. TFX provides a set of specialized libraries with components designed for scalable, high-performance machine learning tasks.
TFX provides ML-specific components. For example, Transform that performs feature engineering on the dataset or Pusher that deploys the model on a serving infrastructure.
It consists of libraries like TensorFlow Data Validation (TFDV) dedicated to data validation, data distribution comparison, and schema control that helps to keep consistency between training and serving tasks. ML Metadata (MLMD) keeps track of metadata at each stage of an ML workflow.
TFX uses Tensorboard dashboards for training results visualization and model performance monitoring.
At the infrastructure level, all data processing and feature engineering jobs are executed by Apache Beam in order to implement data-parallel pipelines. For orchestration, it uses AirFlow or Kubeflow.
I encourage you to study the concepts of this framework in order to better understand production ML challenges.
Who is an MLOps engineer?
MLOps engineer, by definition, is responsible for the deployment and maintenance of ML systems. Although from my personal perspective, it’s much more. MLOps engineers have the potential to help DS teams to deliver faster by optimizing the whole ML development process. Thus they should be present at each step of the ML lifecycle in order to automatize, develop pipelines and introduce the best software and system engineering practices as early as possible.
In reality, at the point of writing, the understanding of MLOps varies between companies. MLOps definition and responsibilities depend on the team’s structure, legacy relationships between departments, infrastructure, available skill sets, etc. In many organizations, MLOps Engineer does not function as a separate role. Since MLOps is actually a set of ML engineering practices, it’s ML Engineers who are often responsible for the deployment. In some organizations, data engineers are doing this job due to their experience with automation tools like AirFLow. In others, where deployment requires more infrastructure-related skills, IT/cloud architects are deploying and maintaining ML systems on-premise or in the cloud.
However, at some point, along with the increase in the number and sophistication of ML applications, companies create dedicated MLOps roles.
Core MLOps skills revolve around production machine learning and automation. Production machine learning is a combination of best software engineering practices applied to the ML code.
Let’s be honest. When we have to come up with a Proof of Concept for an ML solution in a month or so, there’s enough hassle with data wrangling, feature engineering, model tuning, and results in analysis to be worried about code quality. And as great as Jupyter Notebooks are, they do not enforce best practices (thank God!). After a phase of experimentation, technical debt should be paid down by refactoring code, reducing dependencies, deleting dead code, securing APIs, improving documentation, and more. I think that software engineering skills are undervalued in the ML world.
So if you’re like me, meaning you’re background is not in computer science, you might find useful articles devoted to the topic, like “Software engineering fundamentals for Data Scientists, “or try to upskill on platforms like LeetCode.
My general software engineering advice:
- Write Clean and Modular code: Remove duplication, reuse, increase readability and modularity.
- Use code linters – a tool that inspects your code and gives feedback. A linter can tell you the issues in your program and also a way to resolve them. You can run it anytime to ensure that your code is matching standard quality.
- Use code formatters (black or flake8)
- Learn the art of writing tests. Tests are highly underestimated. Unit and integration tests will help you at the development stage, and end-to-end will give you confidence that new software versions do not break the whole e2e pipeline. (e.g. familiarise with pytest)
- document your functions properly following Sphinx markup syntax in order to generate your docs automatically
- learn how to manage your python env and package the code (venv, setuptools, and pyenv)
- A unified project structure will help you to set up your next project. Take a look at cookiecutter-data-science or kedro for ready to use structure. You can also search for cookiecutter-data-science with integrated MLFlow for embedded easy experiments tracking
- Learn and use code IDE like Visual Studio Code or Pycharm – they have many plugins to help you produce better code
- Use Git – it’s a must these days. Embrace pull requests for cross-checks and reviews. Read “The anatomy of perfect pull request” by Hugo Dias
Production ML code development advice:
- Notebooks are great for EDA end experimentation! For production – they are not.
- Use pipelines instead of separate scripts for reproducibility, integrity, hyperparameter search.
- Check if your code is efficient
- with memory:
- use proper (lowest footprint) data types (beware of “Object” type)
- avoid unnecessary copying of DataFrames
- with execution speed:
- use vectorization
- numpy instead of pandas
- use multiprocessing for parallelization wherever possible
- with memory:
Google the principles of effective pandas.
Although data is not always coming from relational databases, I’d say that SQL is a must. Sooner or later, you’ll face the need to write queries. While at the ML experimentation phase, it’s often convenient and feasible to load raw chunks of data to pandas DF, in production, it’s often cheaper and more effective to leverage specialized DB engines. Sometimes you just have to refactor existing queries, other times, you’ll use SQL just to make sure that everything is correct. Regardless of the case, you need a solid SQL base:
- Understand joins.
- Learn and understand different aggregation functions to be used in GROUP BY
- Understand windowing/analytical functions
- Use the “with” statement to organize complex code
- Read about writing performant queries. Use Explain syntax to optimize queries.
You’ll be amazed at how powerful SQL is.
Another important skill is how to containerize your code in order to gain independence from hardware. Fortunately, you need just the basics. Learn Docker or open-source podman. For best practices, read this article.
Companies are moving to the cloud. You need the basics at least from one provider. Choose one and learn what it has to offer in terms of ML and MLOps (most of them offer free tiers). Then, you’ll be easily able to transfer your experience to a different one.
MLOps engineers work in close cooperation with Data Scientists, Data Engineers, and IT. Good communication skills are crucial here. Whether it’s about clarifying ML code nuances during refactoring, integrating with data engineering pipelines, or provisioning secure resources in the IT-managed infrastructure ecosystem. Setting up an ML system requires coordination and management skills built on a solid base of clear communication.
MLOps is an ML engineering culture and practice that aims at unifying ML system development (Dev) and ML system operation (Ops). MLOps Engineer is an emerging role with great potential and huge demand. Companies want to get value out of their investments in DS product development. Regardless of the role title, MLOps skill set is crucial in order to shorten the time to market and build high-quality and reliable ML-enabled systems.
My personal view is that MLOps competencies, consisting of solid software development, ML modeling understanding, and automation, will be highly looked for in the next couple of years. So regardless if you plan to take up this role or just want to raise your market value as an ML engineer or Data Scientist, MLOps provides a set of skills worth acquiring. Happy learning.
¹ “Hidden Technical Debt in Machine Learning Systems – NeurIPS ….” http://papers.neurips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems.pdf. Accessed 17 Apr. 2022.
² “Blue/Green Deployments – Amazon SageMaker.” https://docs.aws.amazon.com/sagemaker/latest/dg/deployment-guardrails-blue-green.html. Accessed 17 Apr. 2022.
³ “Practitioners Guide to MLOps | Google Cloud.” https://cloud.google.com/resources/mlops-whitepaper. Accessed 17 Apr. 2022.
⁴ “Practitioners Guide to MLOps | Google Cloud.” https://cloud.google.com/resources/mlops-whitepaper. Accessed 17 Apr. 2022.