Integration of Kedro with MLflow as Kedro DataSet and Hooks (callbacks)¶

How to use MLflow from Kedro projects¶

Kedro DataSet and Hooks (callbacks) are provided to use MLflow without adding any MLflow-related code in the node (task) functions.

Kedro DataSet
- pipelinex.MLflowDataSet(https://github.com/Minyus/pipelinex/blob/master/src/pipelinex/extras/datasets/mlflow/mlflow_dataset.py)
  
  Kedro Dataset that saves data to or loads data from MLflow. Set dataset argument as follows.
  - If dataset is set to a Kedro DataSet object or a dictionary, it will be saved/loaded as an MLFlow artifact.
  - If dataset is set to a string either {“json”, “csv”, “xls”, “parquet”, “png”, “jpg”, “jpeg”, “img”, “pkl”, “txt”, “yml”, “yaml”}, Kedro DataSet object will be created with the string as the file extension and will be saved/loaded as an MLflow artifact. Under the hood, the following Kedro DataSet classes will be used (inspired by Kedro Wings).
    dataset_dicts = { "json": {"type": "json.JSONDataSet"}, "csv": {"type": "pandas.CSVDataSet"}, "xls": {"type": "pandas.ExcelDataSet"}, "parquet": {"type": "pandas.ParquetDataSet"}, "pkl": {"type": "pickle.PickleDataSet"}, "png": {"type": "pillow.ImageDataSet"}, "jpg": {"type": "pillow.ImageDataSet"}, "jpeg": {"type": "pillow.ImageDataSet"}, "img": {"type": "pillow.ImageDataSet"}, "txt": {"type": "text.TextDataSet"}, "yaml": {"type": "yaml.YAMLDataSet"}, "yml": {"type": "yaml.YAMLDataSet"}, }
  - If dataset is set to a string “p”, the value will be saved/loaded as an MLflow parameter (string).
  - If dataset is set to a string “m”, the value will be saved/loaded as an MLflow metric (numeric).
  - If dataset is set to None (default), MLflow will not be used.
  Regarding all the options, see the API document

Kedro Hooks

pipelinex.MLflowBasicLoggerHook: Configures MLflow logging and logs duration time for the pipeline to MLflow.
pipelinex.MLflowArtifactsLoggerHook: Logs artifacts of specified file paths and dataset names to MLflow.
pipelinex.MLflowDataSetsLoggerHook: Logs datasets of (list of) float/int and str classes to MLflow.
pipelinex.MLflowTimeLoggerHook: Logs duration time for each node (task) to MLflow and optionally visualizes the execution logs as a Gantt chart by plotly.figure_factory.create_gantt if plotly is installed.
pipelinex.AddTransformersHook: Adds Kedro transformers such as:
- pipelinex.MLflowIOTimeLoggerTransformer: Logs duration time to load and save each dataset with args:

Regarding all the options, see the API document

MLflow-ready Kedro projects can be generated by the Kedro starters (Cookiecutter template) which include the following example config:

# catalog.yml

# Write a pickle file & upload to MLflow
model:
  type: pipelinex.MLflowDataSet
  dataset: pkl

# Write a csv file & upload to MLflow
pred_df: 
  type: pipelinex.MLflowDataSet
  dataset: csv

# Write an MLflow metric
score:
  type: pipelinex.MLflowDataSet
  dataset: m  

# catalog.py (alternative to catalog.yml)

catalog_dict = {
  "model": MLflowDataSet(dataset="pkl"),  # Write a pickle file & upload to MLflow
  "pred_df": MLflowDataSet(dataset="csv"),  # Write a csv file & upload to MLflow
  "score": MLflowDataSet(dataset="m"),  # Write an MLflow metric
}

# mlflow_config.py 

import pipelinex

mlflow_hooks = (
  pipelinex.MLflowBasicLoggerHook(
      enable_mlflow=True,  # Enable configuring and logging to MLflow
      uri="sqlite:///mlruns/sqlite.db",
      experiment_name="experiment_001",
      artifact_location="./mlruns/experiment_001",
      offset_hours=0,  # Specify the offset hour (e.g. 0 for UTC/GMT +00:00) to log in MLflow
  ),  # Configure and log duration time for the pipeline
  pipelinex.MLflowArtifactsLoggerHook(
      enable_mlflow=True,  # Enable logging to MLflow
      filepaths_before_pipeline_run=[
          "conf/base/parameters.yml"
      ],  # Optionally specify the file paths to log before pipeline is run
      filepaths_after_pipeline_run=[
          "data/06_models/model.pkl"
      ],  # Optionally specify the file paths to log after pipeline is run
  ),  # Log artifacts of specified file paths and dataset names
  pipelinex.MLflowDataSetsLoggerHook(
      enable_mlflow=True,  # Enable logging to MLflow
  ),  # Log output datasets of (list of) float, int, and str classes
  pipelinex.MLflowTimeLoggerHook(
      enable_mlflow=True,  # Enable logging to MLflow
  ),  # Log duration time to run each node (task)
  pipelinex.AddTransformersHook(
      transformers=[
          pipelinex.MLflowIOTimeLoggerTransformer(
              enable_mlflow=True
          )  # Log duration time to load and save each dataset
      ],
  ),  # Add transformers
)

Logged metrics shown in MLflow's UI

Gantt chart for execution time, generated using Plotly, shown in MLflow's UI

Comparison with `kedro-mlflow` package¶

Both PipelineX and kedro-mlflow provide integration of MLflow to Kedro. Here are the comparisons.

Features supported by both PipelineX and kedro-mlflow
- Kedro DataSets and Hooks to log (save/upload) artifacts, parameters, and metrics to MLflow.
- Truncate MLflow parameter values to 250 characters to avoid error due to MLflow parameter length limit.
Features supported by only PipelineX
- [Time logging] Option to log execution time for each task (Kedro node) as MLflow metrics
- [Gantt logging] Option to log Gantt chart HTML file that visualizes execution time using Plotly as an MLflow artifact (inspired by Apache Airflow)
- [Automatic backend Kedro DataSets for common artifacts] Option to specify a common file extension ({“json”, “csv”, “xls”, “parquet”, “png”, “jpg”, “jpeg”, “img”, “pkl”, “txt”, “yml”, “yaml”}) so the Kedro DataSet object will be created behind the scene instead of manually specifying a Kedro DataSet including filepath in the catalog (inspired by Kedro Wings).
- [Automatic logging for MLflow parameters and metrics] Option to log MemoryDataSet objects (each Kedro node Python func input/output not listed in catalog) as MLflow parameters or metrics, instead of manually specifying a Kedro DataSet in the catalog. If the data type is float or int, the MemoryDataSet object will be logged as an MLflow metric. If the data type is str or data structure (either {list, tuple, dict, set}), the MemoryDataSet object will be logged as an MLflow parameter (after being stringified).
- [Direct artifact logging] Option to specify the paths of any data to log as MLflow artifacts after Kedro pipeline runs without using a Kedro DataSet, which is useful if you want to save local files (e.g. info/warning/error log files, intermediate model weights saved by Machine Learning packages such as PyTorch and TensorFlow, etc.)
- [Environment Variable logging] Option to log Environment Variables
- [Downloading] Option to download MLflow artifacts, params, metrics from an existing MLflow experiment run using the Kedro DataSet
- [Up to date] Support for Kedro 0.17.0 (released in Dec 2020) or later
Features provided by only kedro-mlflow
- A wrapper for MLflow’s log_model
- Configure MLflow logging in a YAML file
- Option to use MLflow tag or raise error if MLflow parameter values exceed 250 characters.

Integration of Kedro with MLflow as Kedro DataSet and Hooks (callbacks)¶

How to use MLflow from Kedro projects¶

Comparison with kedro-mlflow package¶

Comparison with `kedro-mlflow` package¶