Skip to content
forked from IBM/MLOps-CPD

This repo has an IBM's Narrative of MLOps. It uses all the services in IBM's Cloud Pak for Data stack to actualise what an MLOps flow looks like.

License

Notifications You must be signed in to change notification settings

iIias/MLOps-CPD

 
 

Repository files navigation

MLOps in Cloud Pak for Data

Welcome đź‘‹ to our MLOps repository!

This documentation describes IBM's MLOps flow implemented using services in IBM's Cloud Pak for Data stack. This asset is specifically created to enable the rapid and effortless creation of end-to-end machine learning workflows, while simultaneously exhibiting the strengths of the products found in our Cloud Pak for Data stack.

We understand that organizations may come with unique ML use cases, which is why - alongside simplicity - we've placed a strong emphasis on modularity. Our solution provides a full fundamental workflow documentation, and comes with plug-and-play integration for custom-built models (PyTorch, Tensorflow, Keras...) and CI tests. Additionally, our project is built using Watson Pipelines, which offers extensive drag-and-drop modularity, providing you with the flexibility and customization options you need to create tailored ML solutions that perfectly fit your organization's unique requirements.

Fig. 1.: Architecture of the MLOps flow

Overview

Dataset, Model and Data Science problem

Throughout the demo described in detail, we use a biased version of the German Credit Risk dataset. To predict credit risk in the setup instructions below, we leverage an SKLearn Pipeline in which we place, train and test a LightGBM model. (Alternatively you may provide a custom-built model or choose one from /custom_models). The code is written in Python 3.9 and requires access to IBM Watson Studio, Watson Machine Learning, Watson Knowledge Catalog, and Watson OpenScale. The architecture consists of three stages: development, pre-prod, and prod. The process includes: receiving code updates, training, deploying, and monitoring models.

In addition to common metrics (e.g. accuracy), it is crucial to ensure fairness and ethical considerations in the decision-making process. To address this, monitoring and testing must be conducted on a regular basis to identify and mitigate any potential biases in the model.

Prerequisites on IBM Cloud

In order to use the above asset we need to have access to have an IBM environment with authentication. IBM Cloud Account with following services:

  1. IBM Watson Studio
  2. IBM Watson Machine Learning
  3. IBM Watson Knowledge Catalog with Factsheets and Model Inventory
  4. IBM Watson OpenScale

Please ascertain you have appropriate access in all the services.

The runs are also governed by the amount of capacity unit hours (CUH) you have access to. If you are running on the free plan please refer to the following links:

  1. https://cloud.ibm.com/catalog/services/watson-studio
  2. https://cloud.ibm.com/catalog/services/watson-machine-learning
  3. https://cloud.ibm.com/catalog/services/watson-openscale
  4. https://cloud.ibm.com/catalog/services/watson-knowledge-catalog

Branch Management

This repo has two branches, master and pre-prod. The master branch is served as the dev branch, and receives direct commits from the linked CP4D project. When a pull request is created to merge the changes into the pre-prod branch, Jenkins will automatically start the CI tests.

Process Overview

In this repo we demonstrate three steps in the MLOps process:

  1. Development: orchestrated experiments and generate source code for pipelines
  2. Pre-prod: receives code updates from dev stage and contain CI tests to make sure the new code/model integrates well, trains, deploys and monitors the model in the pre-prod deployment space to validate the model. The validated model can be deployed to prod once approved by the model validator.
  3. Prod: deploys the model in the prod environment and monitors it, triggers retraining jobs (eg. restart pre-prod pipeline or offline modeling)

1. Getting Started

1.1. Creating a Project Space in Watson Studio

You create a project to work with data and other resources to achieve a particular goal, such as building a model or integrating data.

(⚠️ We plan on offering this asset as a fully pre-built project space demo within the "Create a project from a sample or file" Option. For now, you will have to construct it manually.)

  1. Click New project on the home page or on your Projects page.
  2. Create an empty project.
  3. On the New project screen, add a name. Make it short but descriptive.
  4. If appropriate for your use case, mark the project as sensitive. The project has a sensitive tag and project collaborators can't move data assets out of the project. You cannot change this setting after the project is created.
  5. Choose an existing object storage service instance or create a new one. Click Create. You can start adding resources to your project.

Along with the creation of a project, a bucket in your object storage instance will be created. This bucket will look like [PROJECT_NAME]-donotdelete.... You can use this bucket through out this project, however we recommend creating a separate bucket in which we will store the dataset, train/test split et cetera.


🪣 See how you can setup your own Bucket in COS
  1. Navigate to your COS as explain in Step 3 above.

  2. Click on buckets. Create a bucket.

  1. Click "Customise Bucket".
  1. Name the Bucket
  1. Click create.

Now download the dataset (german_credit_data_biased_training.csv) and place it in the bucket you chose to use for the rest of this tutorial.

1.2. Creating Deployment Spaces

For IBM Watson Machine Learning, we will need three spaces:

  1. MLOps_dev : Dev Space to deploy your models and test before being pushed to the pre-prod
  2. MLOps_preprod : Pre-prod Space to deploy and test and validate your models. The Validator uses this environment before giving a go ahead to push the models in production.
  3. MLOps_prod : Production Space to deploy your validated models and monitor it.

1.3. Preparing the Notebooks

In this section, we will first setup the custom Python environments, collect necessary credentials, upload the notebooks, and modify them. The pre-defined environments (henceforth called software configuration) do not contain all the Python packages we require. Therefore we will create custom software configurations prior to adding the notebooks.

Python environment customisations

Some of the notebooks require quite a few dependencies, which should not be manually installed via pip in each notebook every time. To avoid doing that, we will create software configurations.


⚠️ Click here if you do not know how to customize environments in Watson Studio
  1. Navigate to your Project overview, select the "Manage" tab and select "Environments" in the left-hand menu. Here, check that no runtime is active for the environment template that you want to change. If a runtime is active, you must stop it before you can change the template.
software_config-create-button
  1. Under Templates click New template and give it a name (for the pipeline preferably one of those described below), specify a hardware configuration (we recommend 2 vCPU and 8GB RAM for this project, but you can scale up or down depending on your task). When you are done click Create.
software_config-create-window
  1. You can now create a software customization and specify the libraries to add to the standard packages that are available by default.

(For more details, check out Adding a customization in the Documentation)


  • Use Python 3.9
  • Modify the pip part of the Python environment customisation script below:
# Modify the following content to add a software customization to an environment.
# To remove an existing customization, delete the entire content and click Apply.
# The customizations must follow the format of a conda environment yml file.

# Add conda channels below defaults, indented by two spaces and a hyphen.
channels:
  - defaults

# To add packages through conda or pip, remove the # on the following line.
dependencies:

# Add conda packages here, indented by two spaces and a hyphen.
# Remove the # on the following line and replace sample package name with your package name:
#  - a_conda_package=1.0

# Add pip packages here, indented by four spaces and a hyphen.
# Remove the # on the following lines and replace sample package name with your package name.
  - pip:
    [ADD CUSTOMISATION PACKAGES HERE]

Environments used in this asset:

Custom_python environment

  - pip:
    - tensorflow-data-validation
    - ibm_watson_studio_pipelines

pipeline_custom environment

  - pip:
    - ibm_watson_studio_pipelines
    - ibm-aigov-facts-client

openscale environment

  - pip:
    - ibm_cloud_sdk_core
    - ibm_watson_openscale
    - ibm_watson_studio_pipelines
    - ibm-aigov-facts-client

Required Credentials

Before you run a notebook you need to obtain the following credentials and add the COS credentials to the beginning of each notebook. The Cloud API key must not be added to the notebooks since it is passed through the pipeline later.

a) The basic requirement is to get your IBM Cloud API Key (CLOUD_API_KEY) for all the pipelines.


âť“ Where can I create/generate an API Key?
  1. Navigate to https://cloud.ibm.com

  2. (On the top right) Select Manage > Access(IAM).

  1. Click on the API keys and create new API Key.
  1. Name the API Key and Copy or Download it.

b) Secondly you will need the following IBM Cloud Object Storage (COS) related variables, which will allow the notebooks to interact with your COS Instance.

The variables are:

Universal

Project Bucket (auto-generated e.g. "mlops-donotdelete-pr-qxxcecxi1d")

MLOps Bucket (e.g. "mlops-asset")


âť“ Where can I find these credentials for Cloud Object Storage?
  1. Go to cloud.ibm.com and select the account from the drop down.
  2. Go to Resource list by either clicking on the left hand side button or https://cloud.ibm.com/resources.
  3. Go to Storage and select the Cloud Object Storage instance that you want to use.
  1. Select "Service Credentials" and Click "New Credential:
  1. Name the credential and hit Add.
  1. Go to the Saved credential and click to reveal your credential. You can use these values to fill the variables

You will need to define those variables at the top level of each notebook. Here is an example:

## PROJECT COS 
AUTH_ENDPOINT = "https://iam.cloud.ibm.com/oidc/token"
ENDPOINT_URL = "https://s3.private.us.cloud-object-storage.appdomain.cloud"
API_KEY_COS = "xyz"
BUCKET_PROJECT_COS = "mlops-donotdelete-pr-qxxcecxi1dtw94"


##MLOPS COS
ENDPOINT_URL_MLOPS = "https://s3.jp-tok.cloud-object-storage.appdomain.cloud"
API_KEY_MLOPS = "xyz"
CRN_MLOPS = "xyz"
BUCKET_MLOPS  = "mlops-asset"

Alternatively, to make things easier, you may set them as Global Pipeline Parameters. This will allow you to e.g. switch the COS Bucket you are using without having to edit mulitple notebooks. Instead, you will only have to edit the parameter. Taking advantage of this feature will prove itself useful when using multiple pipelines later on.

The parameter strings should look like the example below in order for the notebooks to extract the correct values. Prepare one for your manually created Bucket and one for the Bucket attached to the project space.

{"API_KEY": "abc", "CRN": null, "AUTH_ENDPOINT": "https://iam.cloud.ibm.com/oidc/token", "ENDPOINT_URL": "https://s3.private.us.cloud-object-storage.appdomain.cloud", "BUCKET": "mlopsshowcaseautoai-donotdelete-pr-diasjjegeind"}

Now you are ready to start!

1.4. Adding the Notebooks to the Project Space

This section describes how you can add the notebooks that take care of data connection, validation and preparation, as well as model training and deployment.

When this asset was created from scratch, it was laid out for our CPDaaS solution. However, there are slight - but for this project relevant - differences between the two including the absence of a file system and a less refined Git integration in CPDaaS. We are currently weighing the pros and cons of two approaches: Highlighting points of this documentation where CPDaaS is limited (including a work-around), or offering a separate repository.

Adding the Notebooks (CPDaaS)

The Git integration within CPDaaS is not as advanced as that found in our On-Prem solution. As long as that is the case, the notebooks found in the repository must be manually added to the project space.

đź’» Manually adding a notebook to the project space

Download the repository to your local machine and navigate to your project space. On the asset tab, click New Asset .

mlops-new_notebook

In the tool selection, select Jupyter notebook editor. Upload the desired notebook. A name will automatically be assigned based on the filename. Make sure to select our previously added Software Configuration Custom_python as the environment to be used for the notebook. mlops-new_notebook_env

Repeat this procedure for all notebooks.


Note: As previously mentioned, CPDaaS does not come with a filesystem. The only efficient way to include utility scripts (see utility scripts) to e.g. handle catalog operations is to clone the repository manually from the notebook. This has been documented in each notebook. The corresponding cells are commented out at the top level of each notebook and must only be uncommented when operating on CPDaaS.

load_utils

Adding the Notebooks (On-Prem)

tbd

1.5. Creating Notebook Jobs from Notebooks

In order to move a Notebook from a project space to a deployment space, you will have to create a Notebook Job. Notebook Jobs represent non-interactive executables of a snapshot your notebook. When creating a Notebook Job you are offered many options the choice of a Software Configuration (virtual-env), Notifications, and Scheduling. Most importantly you are offered the option to set a Notebook Job to a hard-set version of the Notebook, or to always use the "Latest Version". With the latter, the Notebook Job is always updated automatically subsequent to saving a Notebook.


⚠️ How to create a WS Notebook Job

In an earlier version of Watson Studio Pipelines, you were able to drag a Run notebook block into the canvas to use as pipeline node. This functionality has been replaced with the Run notebook job block.

Prior to selecting a Notebook within the Settings of the Run notebook job block, you have to create a notebook job from the Project Space View under the Assets tab.

notebook-job_create

For the MLOps workflow to work as intended, it is important that you select Latest as the notebook Version for your notebook job. Otherwise, the notebook job block in your pipeline will be set to a specific previous version of the notebook, therefore changes in your code would not affect your pipeline.

notebook-job_versioning

However, even when having selected Latest as the notebook version to use for your notebook job, you will have to select File > Save Version after performing code changes in your notebook. Only then will the notebook register the changes.

To check the log and debug a pipeline: When the pipeline is running, double click on the node that is currently running to open Node Inspector, as shown in the below image. The log will contain all the notebook run status, the prints and errors where the notebook fails.

Screenshot 2022-11-28 at 7 45 43 pm


2. Pipeline Setup

For this section you need to know how to create a WS Pipeline and how to correctly setup Notebook Jobs, which you will need to add Notebooks to a Pipeline. Check out the following toggleable sections to learn how to do that.


⚠️ How to create a WS Pipeline

In your CP4D project, click the blue button New Asset . Then find Pipelines

Screenshot 2022-11-25 at 2 05 04 pm

Select Pipelines and give the pipeline a name.

Once the pipeline is created, you will see the pipeline edit menu and the palette on the left.

Screenshot 2022-11-25 at 2 10 16 pm

Expand the Run section and drag and drop the Run notebook block.

Double click the block to edit the node.

---

2.1. Development

Offline modeling

Offline modeling includes the usually data exploration and data science experiments. In this step, you can try different data manipulation, feature engineering and machine learning models.

The output of this stage is code assets, for example Python scripts or Jupyter notebooks that can be used as blocks in the pipelines.

In this example, the output scripts are Python scripts in Jupyter notebooks. They are version controled with Git, as shown in this repository, and serve as components in the pre-prod pipeline.

You can experiment with an orchestrated dev pipeline, which would include

  • Data connection and validation

  • Data preparation

  • Model training and evaluation

  • Model deployment

  • Model validation (optional)

Below is a dev pipeline in Watson Studio Pipeline:

Screenshot 2022-11-25 at 3 18 33 am

Notebook 1: Connect and validate data

This notebook source code can be found in connect_and_validate_data.ipynb.

It does the following:

  • Load the training data german_credit_risk.csv from cloud object storage (COS)
  • Data Validation. It comprises of the folllowing steps:
    • Split the Data
    • Generate Training Stats on both Splits
    • Infer Schema on both Splits
    • Check for data anomalies

Notebook 2: Data preparation

This notebook source code can be found in data_preparation.ipynb, which does the following:

  • Load train and test data from COS and split the X and y columns
  • Encode features
  • Save processed train and test data to COS

Notebook 3: Model training and evaluation

This notebook source code can be found in train_models.ipynb, which does the following:

  • Load train and test data from COS, split train to train and validation data
  • Load the pre-processing pipeline
  • Train the model
  • Save train and val loss to COS
  • Calculate AUC-ROC
  • Store the model in the project
  • Track the model runs and stages with AI Factsheets

Notebook 4: Model deployment

This notebook source code can be found in deploy_model.ipynb, which does the following:

  • Load the trained mode from model registry
  • Promote the model to a deployment space and deploy the model
  • Test the endpoint

By changing the input to this notebook, the model can be deployed to dev, pre-prod and prod spaces.

2.2. Pre-prod

Continuous integration

When the Jupyter notebooks have a change committed and a pull request is made, Jenkins will start the CI tests.

The source code is stored in the jenkins directory and the documentation can be viewed here

CI Test Notebooks

As with any other MLOps pipeline, you should rigorously check whether or not your current model meets all the requirements you defined. In order to test this, we added a folder containing a small repertoire of CI tests which you can find here.

It is the overarching idea that a Data Scientist works primarily with the Notebooks themselves and manually invokes the development pipeline in order to initially test their work. The updated Notebooks should only be committed and pushed to the repository if the development pipeline completes successfully.

Therefore we suggest that you use the CI test repertoire to the extend that you can. Add tests that you would like to have to the end of your development pipeline in a plug&play manner. You may of course want to edit those CI test notebooks to set certain thresholds or even write your own tests.

Examples:

  • Pipeline component integration test: run the pipeline in dev environment to check if it successfully runs.
  • deserialize_artifact.ipynb will download the model stored in your COS Bucket. It will be deserialized and loaded into memory which is tested by scoring a few rows of your test data. This test is thus ensuring successful serialization. You may extend this test by checking the size of the model in memory or the size of the serialized model in storage and set a threshold, in order for the pipeline to fail when your model exceeds a certain size.
  • model_convergence.ipynb will download the pickled training and validation loss data from your COS Bucket. It ensures that the training loss is continuously decreasing. You may extend this test by analysing training and validation loss to e.g. avoid serious underfitting or overfitting of the model.

Further recommended CI tests

  • Behaviour Tests

    • Invariance
    • Directionality
    • Minimum functionality
  • Adversarial Tests

    • Check to see if the model can be affected by direct adversarial attacks
  • Regression Tests

    • Check specific groups within the test set to ensure performance is retained in this group after retraining
  • Miscellaneous Tests

    • Test input data scheme
    • Test with unexpected input types (null / Nan)
    • Test output scheme is as expected
    • Test output errors are handled correctly

Continuous delivery - pipeline

After the CI tests passed, the admin/data science lead will merge the changes and Jenkins will trigger the following pre-prod pipeline:

Screenshot 2022-11-25 at 3 22 00 am

  • Data Extraction and Data Validation

It runs the notebook connect_and_validate_data.ipynb with:

Environment

pipeline_custom

Input params

cloud_api_key, Select from pipeline parameter
training_file_name, String

Output params

anomaly_status, Bool
files_copied_in_cos, Bool

We define a pipeline parameter cloud_api_key to avoid having the API key hardcoded in the pipeline:

[TO DO: insert picture here]

This block is followed by a Val check condition:

Screenshot 2022-11-28 at 5 19 26 am

  • Data preparation

It runs the notebook data_preparation.ipynb

Environment

pipeline_custom

Input params

cloud_api_key, Select from pipeline parameter

Output params

data_prep_done, Bool

This block is followed by a prep check condition:

Screenshot 2022-11-28 at 5 20 54 am

  • Model Training and Model Evaluation

It runs the notebook train_models.ipynb

In WS Pipeline you can assign input to be the output from another node. To do this, select the folder icon next to environment variables:

Screenshot 2022-11-28 at 4 45 27 am

Then select the node and the output you need

Screenshot 2022-11-28 at 4 46 13 am

Environment

pipeline_custom

Input params

feature_pickle, String
apikey, Select from pipeline parameter
model_name, String
deployment_name, String

Output params

training_done, Bool
auc_roc, Float
model_name, String
deployment_name, String
model_id, String
project_id, String

This block is followed by a Train check condition:

Screenshot 2022-11-28 at 5 21 33 am

  • Model Deployment (pre-prod space)

It runs the notebook deploy_model.ipynb

By changing the input parameter space_id, we can set the model to deploy to the pre-prod deployment space in CP4D.

Environment

pipeline_custom

Input params

model_name, Select model_name from Train the Model node
deployment_name, Select deployment_name from Train the Model node
cloud_api_key, Select from pipeline parameter
model_id, Select model_id from Train the Model node
project_id, Select project_id from Train the Model node
space_id, String

Output params

deployment_status, Bool
deployment_id, String
model_id, String
space_id, String

This block is followed by a deployed? condition:

Screenshot 2022-11-28 at 5 22 25 am

  • Model Monitoring and Model Validation

It runs the notebook monitor_models.ipynb:

Notebook 5: Model monitoring

This notebook source code can be found in monitor_models.ipynb, which does the following:

  • Create subscription for the model deployment in Openscale
  • Enable quality, fairness, drift, explainability, and MRM in Openscale
  • Evaluate the model in Openscale

The trained model is saved to the model registry.

After the pipeline and model prediction service is verified to be successful in the pre-prod space, We can mannually deploy the model to the production environment.

This node has:

Environment

openscale

Input params

data_mart_id, String
model_name, Select model_name from Train the Model node
deployment_name, Select deployment_name from Train the Model node
cloud_api_key, Select from pipeline parameter
deployment_id, Select deployment_id from Deploy Model - Preprod node
model_id, Select model_id from Deploy Model - Preprod node
space_id, Select space_id from Deploy Model - Preprod node
service_provider_id, String
training_data_reference_file, String

Output params

None

Once the model is validated, and approved by the model validator, the model can be deployed to the prod environment.

2.3. Prod

In this example we reuse the deploy_model.ipynb and monitor_models.ipynb to create the deployment job.

Screenshot 2022-12-06 at 3 35 18 am

  • Deployment Checks

This step checks if the the model is approved for production in Openscale.

It runs the notebook [Checks for Model Production.ipynb](Checks for Model Production.ipynb)

  • Model deployment (prod space)

In this step, the validated model from the pre-prod is deployed in the production deployment space.

It runs the notebook deploy_model.ipynb

The input parameter space_id is the prod deployment space ID.

  • Model monitoring

It runs the notebook monitor_models.ipynb

The Openscale API does not allow test data upload for Production subscriptions (like you'd upload for Pre-Production subscriptions). Therefore we save the data into Payload Logging and Feedback tables in Openscale and trigger on-demand runs for Fairness, Quality, Drift monitors and MRM.

  • Model retraining

Model retraining is governed by the underlying usecase.Following list is by no means exhaustive. Some of the common retraining methods are:

  1. Event based : When a business defined event occurs which explicitly impacts the objective of the model, a retrainig is triggered.
  2. Schedule based : Some of the models always rely on latest data viz: forecasting models. So such kind of retraining is schedule driven.
  3. Metric based : When a defined metric like model quality, bias or even data drift falls below a threshold , a retraining is triggered.

We have implemented the 3rd type of retraining as we are monitoring the data drift and augmenting the training data with the drifted records.

In OpenScale, the flow of the retraining looks like:

  • Openscale model monitoring alerts are triggered, and an email is received: manually trigger the retrain job to update the data and restart the pre-prod pipeline
  • Openscale model monitoring alerts are triggered, and an email is received: after some investigation, you decide that you want to try different models or features, therefore restart from the offline modeling stage.

2.4. AI Factsheets

In this project we also demonstrate how we put the model into the a model registry and track the model with AI Factsheets

You can instantiate a Factsheets (as shown in the train_models notebook) with

facts_client = AIGovFactsClient(api_key=apikey, experiment_name="CreditRiskModel", container_type="project", container_id=project_id, set_as_current_experiment=True)

and log the models in Factsheets with the save_log_facts() function in the notebook

After the model has been deployed to pre-prod and prod environments, and evaluated by Openscale, the deployments can be seen in the model entry (in Watson Knowledge Catalog):

Screenshot 2022-11-28 at 4 17 15 am

.

About

This repo has an IBM's Narrative of MLOps. It uses all the services in IBM's Cloud Pak for Data stack to actualise what an MLOps flow looks like.

Topics

Resources

License

Stars

Watchers

Forks

Languages

  • Jupyter Notebook 99.9%
  • Other 0.1%