## MLflow-on-Kedro: Kedro plugin for MLflow users [API document](https://pipelinex.readthedocs.io/en/latest/pipelinex.mlflow_on_kedro.html) ### How to use MLflow from Kedro projects Kedro DataSet and Hooks (callbacks) are provided to use MLflow without adding any MLflow-related code in the node (task) functions. - [`pipelinex.MLflowDataSet`](https://pipelinex.readthedocs.io/en/latest/pipelinex.mlflow_on_kedro.datasets.mlflow.html) Kedro Dataset that saves data to or loads data from MLflow depending on `dataset` argument as follows. - If set to "p", the value will be saved/loaded as an MLflow parameter (string). - If set to "m", the value will be saved/loaded as an MLflow metric (numeric). - If set to "a", the value will be saved/loaded based on the data type. - If the data type is either {float, int}, the value will be saved/loaded as an MLflow metric. - If the data type is either {str, list, tuple, set}, the value will be saved/load as an MLflow parameter. - If the data type is dict, the value will be flattened with dot (".") as the separator and then saved/loaded as either an MLflow metric or parameter based on each data type as explained above. - If set to either {"json", "csv", "xls", "parquet", "png", "jpg", "jpeg", "img", "pkl", "txt", "yml", "yaml"}, the backend dataset instance will be created accordingly to save/load as an MLflow artifact. - If set to a Kedro DataSet object or a dictionary, it will be used as the backend dataset to save/load as an MLflow artifact. - If set to None (default), MLflow logging will be skipped. Regarding all the options, please see the [API document](https://pipelinex.readthedocs.io/en/latest/pipelinex.mlflow_on_kedro.datasets.mlflow.html) - Kedro Hooks - [`pipelinex.MLflowBasicLoggerHook`](https://pipelinex.readthedocs.io/en/latest/pipelinex.mlflow_on_kedro.hooks.mlflow.html#module-pipelinex.mlflow_on_kedro.hooks.mlflow.mlflow_basic_logger): Configures MLflow logging and logs duration time for the pipeline to MLflow. - [`pipelinex.MLflowArtifactsLoggerHook`](https://pipelinex.readthedocs.io/en/latest/pipelinex.mlflow_on_kedro.hooks.mlflow.html#module-pipelinex.mlflow_on_kedro.hooks.mlflow.mlflow_artifacts_logger): Logs artifacts of specified file paths and dataset names to MLflow. - [`pipelinex.MLflowDataSetsLoggerHook`](https://pipelinex.readthedocs.io/en/latest/pipelinex.mlflow_on_kedro.hooks.mlflow.html#pipelinex.mlflow_on_kedro.hooks.mlflow.mlflow_datasets_logger.MLflowDataSetsLoggerHook): Logs datasets of (list of) float/int and str classes to MLflow. - [`pipelinex.MLflowTimeLoggerHook`](https://pipelinex.readthedocs.io/en/latest/pipelinex.mlflow_on_kedro.hooks.mlflow.html#pipelinex.mlflow_on_kedro.hooks.mlflow.mlflow_time_logger.MLflowTimeLoggerHook): Logs duration time for each node (task) to MLflow and optionally visualizes the execution logs as a Gantt chart by [`plotly.figure_factory.create_gantt`](https://plotly.github.io/plotly.py-docs/generated/plotly.figure_factory.create_gantt.html) if `plotly` is installed. - [`pipelinex.AddTransformersHook`](https://pipelinex.readthedocs.io/en/latest/pipelinex.extras.hooks.html#pipelinex.extras.hooks.add_transformers.AddTransformersHook): Adds Kedro transformers such as: - [`pipelinex.MLflowIOTimeLoggerTransformer`](https://pipelinex.readthedocs.io/en/latest/pipelinex.mlflow_on_kedro.transformers.mlflow.html#pipelinex.mlflow_on_kedro.transformers.mlflow.mlflow_io_time_logger.MLflowIOTimeLoggerTransformer): Logs duration time to load and save each dataset with args: Regarding all the options, please see the [API document](https://pipelinex.readthedocs.io/en/latest/pipelinex.mlflow_on_kedro.hooks.mlflow.html) MLflow-ready Kedro projects can be generated by the [Kedro starters](https://github.com/Minyus/kedro-starters-sklearn) (Cookiecutter template) which include the following example config: ```yaml # catalog.yml # Write a pickle file & upload to MLflow model: type: pipelinex.MLflowDataSet dataset: pkl # Write a csv file & upload to MLflow pred_df: type: pipelinex.MLflowDataSet dataset: csv # Write an MLflow metric score: type: pipelinex.MLflowDataSet dataset: m ``` ```python # catalog.py (alternative to catalog.yml) catalog_dict = { "model": MLflowDataSet(dataset="pkl"), # Write a pickle file & upload to MLflow "pred_df": MLflowDataSet(dataset="csv"), # Write a csv file & upload to MLflow "score": MLflowDataSet(dataset="m"), # Write an MLflow metric } ``` ```python # mlflow_config.py import pipelinex mlflow_hooks = ( pipelinex.MLflowBasicLoggerHook( uri="sqlite:///mlruns/sqlite.db", experiment_name="experiment_001", artifact_location="./mlruns/experiment_001", offset_hours=0, ), pipelinex.MLflowCatalogLoggerHook( auto=True, ), pipelinex.MLflowArtifactsLoggerHook( filepaths_before_pipeline_run=["conf/base/parameters.yml"], filepaths_after_pipeline_run=[ "logs/info.log", "logs/errors.log", ], ), pipelinex.MLflowEnvVarsLoggerHook( param_env_vars=["HOSTNAME"], metric_env_vars=[], ), pipelinex.MLflowTimeLoggerHook(), pipelinex.AddTransformersHook( transformers=[pipelinex.MLflowIOTimeLoggerTransformer()], ), ) ```
Logged metrics shown in MLflow's UI
Gantt chart for execution time, generated using Plotly, shown in MLflow's UI