## HatchDict: Python in YAML/JSON [API document](https://pipelinex.readthedocs.io/en/latest/pipelinex.hatch_dict.html) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Minyus/pipelinex/blob/master/notebooks/HatchDict_demo.ipynb) ### Python objects in YAML/JSON #### Introduction to YAML YAML is a common text format used for application config files. YAML's most notable advantage is allowing users to mix 2 styles, block style and flow style. Example: ```python import yaml from pprint import pprint # pretty-print for clearer look # Read parameters dict from a YAML file in actual use params_yaml=""" block_style_demo: key1: value1 key2: value2 flow_style_demo: {key1: value1, key2: value2} """ parameters = yaml.safe_load(params_yaml) print("### 2 styles in YAML ###") pprint(parameters) ``` ``` ### 2 styles in YAML ### {'block_style_demo': {'key1': 'value1', 'key2': 'value2'}, 'flow_style_demo': {'key1': 'value1', 'key2': 'value2'}} ``` To store highly nested (hierarchical) dict or list, YAML is more conveinient than hard-coding in Python code. - YAML's block style, which uses indentation, allows users to omit opening and closing symbols to specify a Python dict or list (`{}` or `[]`). - YAML's flow style, which uses opening and closing symbols, allows users to specify a Python dict or list within a single line. So simply using YAML with Python will be the best way for Machine Learning experimentation? Let's check out the next example. Example: ```python import yaml from pprint import pprint # pretty-print for clearer look # Read parameters dict from a YAML file in actual use params_yaml = """ model_kind: LogisticRegression model_params: C: 1.23456 max_iter: 987 random_state: 42 """ parameters = yaml.safe_load(params_yaml) print("### Before ###") pprint(parameters) model_kind = parameters.get("model_kind") model_params_dict = parameters.get("model_params") if model_kind == "LogisticRegression": from sklearn.linear_model import LogisticRegression model = LogisticRegression(**model_params_dict) elif model_kind == "DecisionTree": from sklearn.tree import DecisionTreeClassifier model = DecisionTreeClassifier(**model_params_dict) elif model_kind == "RandomForest": from sklearn.ensemble import RandomForestClassifier model = RandomForestClassifier(**model_params_dict) else: raise ValueError("Unsupported model_kind.") print("\n### After ###") print(model) ``` ``` ### Before ### {'model_kind': 'LogisticRegression', 'model_params': {'C': 1.23456, 'max_iter': 987, 'random_state': 42}} ### After ### LogisticRegression(C=1.23456, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, l1_ratio=None, max_iter=987, multi_class='warn', n_jobs=None, penalty='l2', random_state=42, solver='warn', tol=0.0001, verbose=0, warm_start=False) ``` This way is inefficient as we need to add `import` and `if` statements for the options in the Python code in addition to modifying the YAML config file. Any better way? #### Python tags in YAML PyYAML provides [UnsafeLoader]() which can load Python objects without `import`. Example usage of `!!python/object` ```python import yaml # You do not need `import sklearn.linear_model` using PyYAML's UnsafeLoader # Read parameters dict from a YAML file in actual use params_yaml = """ model: !!python/object:sklearn.linear_model.LogisticRegression C: 1.23456 max_iter: 987 random_state: 42 """ parameters = yaml.unsafe_load(params_yaml) # unsafe_load required model = parameters.get("model") print("### model object by PyYAML's UnsafeLoader ###") print(model) ``` ``` ### model object by PyYAML's UnsafeLoader ### LogisticRegression(C=1.23456, class_weight=None, dual=None, fit_intercept=None, intercept_scaling=None, l1_ratio=None, max_iter=987, multi_class=None, n_jobs=None, penalty=None, random_state=42, solver=None, tol=None, verbose=None, warm_start=None) ``` Example usage of `!!python/name` ```python import yaml # Read parameters dict from a YAML file in actual use params_yaml = """ numpy_array_func: !!python/name:numpy.array """ try: parameters = yaml.unsafe_load(params_yaml) # unsafe_load required for PyYAML 5.1 or later except: parameters = yaml.load(params_yaml) numpy_array_func = parameters.get("numpy_array_func") import numpy assert numpy_array_func == numpy.array ``` [PyYAML's `!!python/object` and `!!python/name`](https://pyyaml.org/wiki/PyYAMLDocumentation), however, has the following problems. - `!!python/object` or `!!python/name` are too long to write. - Positional (unnamed) arguments are apparently not supported. Any better way? PipelineX provides the solution. #### Alternative to Python tags in YAML PipelineX's HatchDict provides an easier syntax, as follows, to convert Python dictionaries read from YAML or JSON files to Python objects without `import`. - Use `=` key to specify the package, module, and class/function with `.` separator in `foo_package.bar_module.baz_class` format. - [Optional] Use `_` key to specify (list of) positional (unnamed) arguments if any. - [Optional] Add keyword arguments (kwargs) if any. To return an object instance like PyYAML's `!!python/object`, feed positional and/or keyword arguments. If it has no arguments, just feed null (known as `None` in Python) to `_` key. To return an uninstantiated (raw) object like PyYAML's `!!python/name`, just feed `=` key without any arguments. Example alternative to `!!python/object` specifying keyword arguments: ```python from pipelinex import HatchDict import yaml from pprint import pprint # pretty-print for clearer look # You do not need `import sklearn.linear_model` using PipelineX's HatchDict # Read parameters dict from a YAML file in actual use params_yaml=""" model: =: sklearn.linear_model.LogisticRegression C: 1.23456 max_iter: 987 random_state: 42 """ parameters = yaml.safe_load(params_yaml) model_dict = parameters.get("model") print("### Before ###") pprint(model_dict) model = HatchDict(parameters).get("model") print("\n### After ###") print(model) ``` ``` ### Before ### {'=': 'sklearn.linear_model.LogisticRegression', 'C': 1.23456, 'max_iter': 987, 'random_state': 42} ### After ### LogisticRegression(C=1.23456, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, l1_ratio=None, max_iter=987, multi_class='warn', n_jobs=None, penalty='l2', random_state=42, solver='warn', tol=0.0001, verbose=0, warm_start=False) ``` Example alternative to `!!python/object` specifying both positional and keyword arguments: ```python from pipelinex import HatchDict import yaml from pprint import pprint # pretty-print for clearer look params_yaml = """ metrics: - =: functools.partial _: =: sklearn.metrics.roc_auc_score multiclass: ovr """ parameters = yaml.safe_load(params_yaml) metrics_dict = parameters.get("metrics") print("### Before ###") pprint(metrics_dict) metrics = HatchDict(parameters).get("metrics") print("\n### After ###") print(metrics) ``` ``` ### Before ### [{'=': 'functools.partial', '_': {'=': 'sklearn.metrics.roc_auc_score'}, 'multiclass': 'ovr'}] ### After ### [functools.partial(, multiclass='ovr')] ``` Example alternative to `!!python/name`: ```python from pipelinex import HatchDict import yaml # Read parameters dict from a YAML file in actual use params_yaml=""" numpy_array_func: =: numpy.array """ parameters = yaml.safe_load(params_yaml) numpy_array_func = HatchDict(parameters).get("numpy_array_func") import numpy assert numpy_array_func == numpy.array ``` This import-less Python object supports nested objects (objects that receives object arguments) by recursive depth-first search. For more examples, please see [Use with PyTorch](https://pipelinex.readthedocs.io/en/latest/section08.html#use-with-pytorch). This import-less Python object feature, inspired by the fact that Kedro uses `load_obj` for file I/O (`DataSet`), uses `load_obj` copied from [kedro.utils](https://github.com/quantumblacklabs/kedro/blob/0.15.4/kedro/utils.py) which dynamically imports Python objects using [`importlib`](https://docs.python.org/3.6/library/importlib.html), a Python standard library. ### Anchor-less aliasing in YAML/JSON #### Aliasing in YAML To avoid repeating, YAML natively provides Anchor&Alias [Anchor&Alias](https://confluence.atlassian.com/bitbucket/yaml-anchors-960154027.html) feature, and [Jsonnet](https://github.com/google/jsonnet) provides [Variable](https://github.com/google/jsonnet/blob/master/examples/variables.jsonnet) feature to JSON. Example: ```python import yaml from pprint import pprint # pretty-print for clearer look # Read parameters dict from a YAML file in actual use params_yaml=""" train_params: train_batch_size: &batch_size 32 val_batch_size: *batch_size """ parameters = yaml.safe_load(params_yaml) train_params_dict = parameters.get("train_params") print("### Conversion by YAML's Anchor&Alias feature ###") pprint(train_params_dict) ``` ``` ### Conversion by YAML's Anchor&Alias feature ### {'train_batch_size': 32, 'val_batch_size': 32} ``` Unfortunately, YAML and Jsonnet require a medium to share the same value. This is why PipelineX provides anchor-less aliasing feature. #### Alternative to aliasing in YAML You can directly look up another value in the same YAML/JSON file using "$" key without an anchor nor variable. To specify the nested key (key in a dict of dict), use "." as the separator. Example: ```python from pipelinex import HatchDict import yaml from pprint import pprint # pretty-print for clearer look # Read parameters dict from a YAML file in actual use params_yaml=""" train_params: train_batch_size: 32 val_batch_size: {$: train_params.train_batch_size} """ parameters = yaml.safe_load(params_yaml) train_params_dict = parameters.get("train_params") print("### Before ###") pprint(train_params_dict) train_params = HatchDict(parameters).get("train_params") print("\n### After ###") pprint(train_params) ``` ``` ### Before ### {'train_batch_size': 32, 'val_batch_size': {'$': 'train_params.train_batch_size'}} ### After ### {'train_batch_size': 32, 'val_batch_size': 32} ``` ### Python expression in YAML/JSON Strings wrapped in parentheses are evaluated as a Python expression. ```python from pipelinex import HatchDict import yaml from pprint import pprint # pretty-print for clearer look # Read parameters dict from a YAML file in actual use params_yaml = """ train_params: param1_tuple_python: (1, 2, 3) param1_tuple_yaml: !!python/tuple [1, 2, 3] param2_formula_python: (2 + 3) param3_neg_inf_python: (float("-Inf")) param3_neg_inf_yaml: -.Inf param4_float_1e9_python: (1e9) param4_float_1e9_yaml: 1.0e+09 param5_int_1e9_python: (int(1e9)) """ parameters = yaml.load(params_yaml) train_params_raw = parameters.get("train_params") print("### Before ###") pprint(train_params_raw) train_params_converted = HatchDict(parameters).get("train_params") print("\n### After ###") pprint(train_params_converted) ``` ``` ### Before ### {'param1_tuple_python': '(1, 2, 3)', 'param1_tuple_yaml': (1, 2, 3), 'param2_formula_python': '(2 + 3)', 'param3_neg_inf_python': '(float("-Inf"))', 'param3_neg_inf_yaml': -inf, 'param4_float_1e9_python': '(1e9)', 'param4_float_1e9_yaml': 1000000000.0, 'param5_int_1e9_python': '(int(1e9))'} ### After ### {'param1_tuple_python': (1, 2, 3), 'param1_tuple_yaml': (1, 2, 3), 'param2_formula_python': 5, 'param3_neg_inf_python': -inf, 'param3_neg_inf_yaml': -inf, 'param4_float_1e9_python': 1000000000.0, 'param4_float_1e9_yaml': 1000000000.0, 'param5_int_1e9_python': 1000000000} ```