MLflow-on-Kedro: Kedro plugin for MLflow users

pipelinex.mlflow_on_kedro API document

How to use MLflow from Kedro projects

Kedro DataSet and Hooks (callbacks) are provided to use MLflow without adding any MLflow-related code in the node (task) functions.

  • Kedro DataSet

    • pipelinex.MLflowDataSet(https://github.com/Minyus/pipelinex/blob/master/src/pipelinex/extras/datasets/mlflow/mlflow_dataset.py)

      Kedro Dataset that saves data to or loads data from MLflow depending on dataset argument as follows.

      • If set to “p”, the value will be saved/loaded as an MLflow parameter (string).

      • If set to “m”, the value will be saved/loaded as an MLflow metric (numeric).

      • If set to “a”, the value will be saved/loaded based on the data type.

        • If the data type is either {float, int}, the value will be saved/loaded as an MLflow metric.

        • If the data type is either {str, list, tuple, set}, the value will be saved/load as an MLflow parameter.

        • If the data type is dict, the value will be flattened with dot (“.”) as the separator and then saved/loaded as either an MLflow metric or parameter based on each data type as explained above.

      • If set to either {“json”, “csv”, “xls”, “parquet”, “png”, “jpg”, “jpeg”, “img”, “pkl”, “txt”, “yml”, “yaml”}, the backend dataset instance will be created accordingly to save/load as an MLflow artifact.

      • If set to a Kedro DataSet object or a dictionary, it will be used as the backend dataset to save/load as an MLflow artifact.

      • If set to None (default), MLflow logging will be skipped.

      Regarding all the options, see the API document

  • Kedro Hooks

    Regarding all the options, see the API document

    MLflow-ready Kedro projects can be generated by the Kedro starters (Cookiecutter template) which include the following example config:

    # catalog.yml
    
    # Write a pickle file & upload to MLflow
    model:
      type: pipelinex.MLflowDataSet
      dataset: pkl
    
    # Write a csv file & upload to MLflow
    pred_df: 
      type: pipelinex.MLflowDataSet
      dataset: csv
    
    # Write an MLflow metric
    score:
      type: pipelinex.MLflowDataSet
      dataset: m  
    
    # catalog.py (alternative to catalog.yml)
    
    catalog_dict = {
      "model": MLflowDataSet(dataset="pkl"),  # Write a pickle file & upload to MLflow
      "pred_df": MLflowDataSet(dataset="csv"),  # Write a csv file & upload to MLflow
      "score": MLflowDataSet(dataset="m"),  # Write an MLflow metric
    }
    
    # mlflow_config.py
    
    import pipelinex
    
    mlflow_hooks = (
        pipelinex.MLflowBasicLoggerHook(
            uri="sqlite:///mlruns/sqlite.db",
            experiment_name="experiment_001",
            artifact_location="./mlruns/experiment_001",
            offset_hours=0,
        ),
        pipelinex.MLflowCatalogLoggerHook(
            auto=True,
        ),
        pipelinex.MLflowArtifactsLoggerHook(
            filepaths_before_pipeline_run=["conf/base/parameters.yml"],
            filepaths_after_pipeline_run=[
                "logs/info.log",
                "logs/errors.log",
            ],
        ),
        pipelinex.MLflowEnvVarsLoggerHook(
            param_env_vars=["HOSTNAME"],
            metric_env_vars=[],
        ),
        pipelinex.MLflowTimeLoggerHook(),
        pipelinex.AddTransformersHook(
            transformers=[pipelinex.MLflowIOTimeLoggerTransformer()],
        ),
    )
    

Logged metrics shown in MLflow's UI

Gantt chart for execution time, generated using Plotly, shown in MLflow's UI

Comparison with kedro-mlflow package

Both PipelineX’s MLflow-on-Kedro and kedro-mlflow provide integration of MLflow to Kedro. Here are the comparisons.

  • Features supported by both PipelineX and kedro-mlflow

    • Kedro DataSets and Hooks to log (save/upload) artifacts, parameters, and metrics to MLflow.

    • Truncate MLflow parameter values to 250 characters to avoid error due to MLflow parameter length limit.

    • Dict values can be flattened using dot (“.”) as the separator to log each value inside the dict separately.

  • Features supported by only PipelineX

    • [Time logging] Option to log execution time for each task (Kedro node) as MLflow metrics

    • [Gantt logging] Option to log Gantt chart HTML file that visualizes execution time using Plotly as an MLflow artifact (inspired by Apache Airflow)

    • [Automatic backend Kedro DataSets for common artifacts] Option to specify a common file extension ({“json”, “csv”, “xls”, “parquet”, “png”, “jpg”, “jpeg”, “img”, “pkl”, “txt”, “yml”, “yaml”}) so the Kedro DataSet object will be created behind the scene instead of manually specifying a Kedro DataSet including filepath in the catalog (inspired by Kedro Wings).

    • [Automatic logging for MLflow parameters and metrics] Option to log each dataset not listed in the catalog as MLflow parameter or metric, instead of manually specifying a Kedro DataSet in the catalog.

      • If the data type is either {float, int}, the value will be saved/loaded as an MLflow metric.

      • If the data type is either {str, list, tuple, set}, the value will be saved/load as an MLflow parameter.

      • If the data type is dict, the value will be flattened with dot (“.”) as the separator and then saved/loaded as either an MLflow metric or parameter based on each data type as explained above.

      • For example, "data_loading_config": {"train": {"batch_size": 32}} will be logged as MLflow metric of "data_loading_config.train.batch_size": 32)

    • [Flexible config per DataSet] For each Kedro DataSet, it is possible to configure differently. For example, a dict value can be logged as an MLflow parameter (string) as is while another one can be logged as an MLflow metric after being flattened.

    • [Direct artifact logging] Option to specify the paths of any data to log as MLflow artifacts after Kedro pipeline runs without using a Kedro DataSet, which is useful if you want to save local files (e.g. info/warning/error log files, intermediate model weights saved by Machine Learning packages such as PyTorch and TensorFlow, etc.)

    • [Environment Variable logging] Option to log Environment Variables

    • [Downloading] Option to download MLflow artifacts, params, metrics from an existing MLflow experiment run using the Kedro DataSet

    • [Up to date] Support for Kedro 0.17.0 (released in Dec 2020) or later

  • Features provided by only kedro-mlflow

    • A wrapper for MLflow’s log_model

    • Configure MLflow logging in a YAML file

    • Option to use MLflow tag or raise error if MLflow parameter values exceed 250 characters