Integration of Kedro with MLflow as Kedro DataSet and Hooks (callbacks)

How to use MLflow from Kedro projects

Kedro DataSet and Hooks (callbacks) are provided to use MLflow without adding any MLflow-related code in the node (task) functions.

  • Kedro DataSet

    • pipelinex.MLflowDataSet(https://github.com/Minyus/pipelinex/blob/master/src/pipelinex/extras/datasets/mlflow/mlflow_dataset.py)

      Kedro Dataset that saves data to or loads data from MLflow. Set dataset argument as follows.

      • If dataset is set to a Kedro DataSet object or a dictionary, it will be saved/loaded as an MLFlow artifact.

      • If dataset is set to a string either {“json”, “csv”, “xls”, “parquet”, “png”, “jpg”, “jpeg”, “img”, “pkl”, “txt”, “yml”, “yaml”}, Kedro DataSet object will be created with the string as the file extension and will be saved/loaded as an MLflow artifact. Under the hood, the following Kedro DataSet classes will be used (inspired by Kedro Wings).

        dataset_dicts = {
          "json": {"type": "json.JSONDataSet"},
          "csv": {"type": "pandas.CSVDataSet"},
          "xls": {"type": "pandas.ExcelDataSet"},
          "parquet": {"type": "pandas.ParquetDataSet"},
          "pkl": {"type": "pickle.PickleDataSet"},
          "png": {"type": "pillow.ImageDataSet"},
          "jpg": {"type": "pillow.ImageDataSet"},
          "jpeg": {"type": "pillow.ImageDataSet"},
          "img": {"type": "pillow.ImageDataSet"},
          "txt": {"type": "text.TextDataSet"},
          "yaml": {"type": "yaml.YAMLDataSet"},
          "yml": {"type": "yaml.YAMLDataSet"},
        }
        
      • If dataset is set to a string “p”, the value will be saved/loaded as an MLflow parameter (string).

      • If dataset is set to a string “m”, the value will be saved/loaded as an MLflow metric (numeric).

      • If dataset is set to None (default), MLflow will not be used.

      Regarding all the options, see the API document

  • Kedro Hooks

    Regarding all the options, see the API document

    MLflow-ready Kedro projects can be generated by the Kedro starters (Cookiecutter template) which include the following example config:

    # catalog.yml
    
    # Write a pickle file & upload to MLflow
    model:
      type: pipelinex.MLflowDataSet
      dataset: pkl
    
    # Write a csv file & upload to MLflow
    pred_df: 
      type: pipelinex.MLflowDataSet
      dataset: csv
    
    # Write an MLflow metric
    score:
      type: pipelinex.MLflowDataSet
      dataset: m  
    
    # catalog.py (alternative to catalog.yml)
    
    catalog_dict = {
      "model": MLflowDataSet(dataset="pkl"),  # Write a pickle file & upload to MLflow
      "pred_df": MLflowDataSet(dataset="csv"),  # Write a csv file & upload to MLflow
      "score": MLflowDataSet(dataset="m"),  # Write an MLflow metric
    }
    
    # mlflow_config.py 
    
    import pipelinex
    
    mlflow_hooks = (
      pipelinex.MLflowBasicLoggerHook(
          enable_mlflow=True,  # Enable configuring and logging to MLflow
          uri="sqlite:///mlruns/sqlite.db",
          experiment_name="experiment_001",
          artifact_location="./mlruns/experiment_001",
          offset_hours=0,  # Specify the offset hour (e.g. 0 for UTC/GMT +00:00) to log in MLflow
      ),  # Configure and log duration time for the pipeline
      pipelinex.MLflowArtifactsLoggerHook(
          enable_mlflow=True,  # Enable logging to MLflow
          filepaths_before_pipeline_run=[
              "conf/base/parameters.yml"
          ],  # Optionally specify the file paths to log before pipeline is run
          filepaths_after_pipeline_run=[
              "data/06_models/model.pkl"
          ],  # Optionally specify the file paths to log after pipeline is run
      ),  # Log artifacts of specified file paths and dataset names
      pipelinex.MLflowDataSetsLoggerHook(
          enable_mlflow=True,  # Enable logging to MLflow
      ),  # Log output datasets of (list of) float, int, and str classes
      pipelinex.MLflowTimeLoggerHook(
          enable_mlflow=True,  # Enable logging to MLflow
      ),  # Log duration time to run each node (task)
      pipelinex.AddTransformersHook(
          transformers=[
              pipelinex.MLflowIOTimeLoggerTransformer(
                  enable_mlflow=True
              )  # Log duration time to load and save each dataset
          ],
      ),  # Add transformers
    )
    

Logged metrics shown in MLflow's UI

Gantt chart for execution time, generated using Plotly, shown in MLflow's UI

Comparison with kedro-mlflow package

Both PipelineX and kedro-mlflow provide integration of MLflow to Kedro. Here are the comparisons.

  • Features supported by both PipelineX and kedro-mlflow

    • Kedro DataSets and Hooks to log (save/upload) artifacts, parameters, and metrics to MLflow.

    • Truncate MLflow parameter values to 250 characters to avoid error due to MLflow parameter length limit.

  • Features supported by only PipelineX

    • [Time logging] Option to log execution time for each task (Kedro node) as MLflow metrics

    • [Gantt logging] Option to log Gantt chart HTML file that visualizes execution time using Plotly as an MLflow artifact (inspired by Apache Airflow)

    • [Automatic backend Kedro DataSets for common artifacts] Option to specify a common file extension ({“json”, “csv”, “xls”, “parquet”, “png”, “jpg”, “jpeg”, “img”, “pkl”, “txt”, “yml”, “yaml”}) so the Kedro DataSet object will be created behind the scene instead of manually specifying a Kedro DataSet including filepath in the catalog (inspired by Kedro Wings).

    • [Automatic logging for MLflow parameters and metrics] Option to log MemoryDataSet objects (each Kedro node Python func input/output not listed in catalog) as MLflow parameters or metrics, instead of manually specifying a Kedro DataSet in the catalog. If the data type is float or int, the MemoryDataSet object will be logged as an MLflow metric. If the data type is str or data structure (either {list, tuple, dict, set}), the MemoryDataSet object will be logged as an MLflow parameter (after being stringified).

    • [Direct artifact logging] Option to specify the paths of any data to log as MLflow artifacts after Kedro pipeline runs without using a Kedro DataSet, which is useful if you want to save local files (e.g. info/warning/error log files, intermediate model weights saved by Machine Learning packages such as PyTorch and TensorFlow, etc.)

    • [Environment Variable logging] Option to log Environment Variables

    • [Downloading] Option to download MLflow artifacts, params, metrics from an existing MLflow experiment run using the Kedro DataSet

    • [Up to date] Support for Kedro 0.17.0 (released in Dec 2020) or later

  • Features provided by only kedro-mlflow

    • A wrapper for MLflow’s log_model

    • Configure MLflow logging in a YAML file

    • Option to use MLflow tag or raise error if MLflow parameter values exceed 250 characters.