HatchDict
: YAML/JSON enhancement for Python developers¶
Import-less Python object (class and function)¶
YAML is a common text format used for application config files.
YAML’s most notable advantage is allowing users to mix 2 styles, block style and flow style.
Example:
import yaml
from pprint import pprint # pretty-print for clearer look
# Read parameters dict from a YAML file in actual use
params_yaml="""
block_style_demo:
key1: value1
key2: value2
flow_style_demo: {key1: value1, key2: value2}
"""
parameters = yaml.safe_load(params_yaml)
print("### 2 styles in YAML ###")
pprint(parameters)
### 2 styles in YAML ###
{'block_style_demo': {'key1': 'value1', 'key2': 'value2'},
'flow_style_demo': {'key1': 'value1', 'key2': 'value2'}}
To store highly nested (hierarchical) dict or list, YAML is more conveinient than hard-coding in Python code.
YAML’s block style, which uses indentation, allows users to omit opening and closing symbols to specify a Python dict or list (
{}
or[]
).YAML’s flow style, which uses opening and closing symbols, allows users to specify a Python dict or list within a single line.
So simply using YAML with Python will be the best way for Machine Learning experimentation?
Let’s check out the next example.
Example:
import yaml
from pprint import pprint # pretty-print for clearer look
# Read parameters dict from a YAML file in actual use
params_yaml = """
model_kind: LogisticRegression
model_params:
C: 1.23456
max_iter: 987
random_state: 42
"""
parameters = yaml.safe_load(params_yaml)
print("### Before ###")
pprint(parameters)
model_kind = parameters.get("model_kind")
model_params_dict = parameters.get("model_params")
if model_kind == "LogisticRegression":
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(**model_params_dict)
elif model_kind == "DecisionTree":
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(**model_params_dict)
elif model_kind == "RandomForest":
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(**model_params_dict)
else:
raise ValueError("Unsupported model_kind.")
print("\n### After ###")
print(model)
### Before ###
{'model_kind': 'LogisticRegression',
'model_params': {'C': 1.23456, 'max_iter': 987, 'random_state': 42}}
### After ###
LogisticRegression(C=1.23456, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=987,
multi_class='warn', n_jobs=None, penalty='l2',
random_state=42, solver='warn', tol=0.0001, verbose=0,
warm_start=False)
This way is inefficient as we need to add import
and if
statements for the options in the Python code in addition to modifying the YAML config file.
Any better way?
PyYAML provides UnsafeLoader which can load Python objects without import
.
import yaml
# You do not need `import sklearn.linear_model` using PyYAML's UnsafeLoader
# Read parameters dict from a YAML file in actual use
params_yaml = """
model:
!!python/object:sklearn.linear_model.LogisticRegression
C: 1.23456
max_iter: 987
random_state: 42
"""
parameters = yaml.unsafe_load(params_yaml) # unsafe_load required
model = parameters.get("model")
print("### model object by PyYAML's UnsafeLoader ###")
print(model)
### model object by PyYAML's UnsafeLoader ###
LogisticRegression(C=1.23456, class_weight=None, dual=None, fit_intercept=None,
intercept_scaling=None, l1_ratio=None, max_iter=987,
multi_class=None, n_jobs=None, penalty=None, random_state=42,
solver=None, tol=None, verbose=None, warm_start=None)
PyYAML’s !!python/object
and !!python/name
, however, has the following problems.
!!python/object
or!!python/name
are too long to write.Positional (non-keyword) arguments are apparently not supported.
Any better way?
PipelineX provides the solution.
PipelineX’s HatchDict provides an easier syntax, as follows, to convert Python dictionaries read from YAML or JSON files to Python objects without import
.
Use
=
key to specify the package, module, and class/function with.
separator infoo_package.bar_module.baz_class
format.[Optional] Use
_
key to specify (list of) positional arguments (args) if any.[Optional] Add keyword arguments (kwargs) if any.
To return an object instance like PyYAML’s !!python/object
, feed positional and/or keyword arguments. If there is no arguments, just feed null (known as None
in Python) to _
key.
To return an uninstantiated (raw) object like PyYAML’s !!python/name
, just feed =
key without positional nor keyword arugments.
Example:
from pipelinex import HatchDict
import yaml
from pprint import pprint # pretty-print for clearer look
# You do not need `import sklearn.linear_model` using PipelineX's HatchDict
# Read parameters dict from a YAML file in actual use
params_yaml="""
model:
=: sklearn.linear_model.LogisticRegression
C: 1.23456
max_iter: 987
random_state: 42
"""
parameters = yaml.safe_load(params_yaml)
model_dict = parameters.get("model")
print("### Before ###")
pprint(model_dict)
model = HatchDict(parameters).get("model")
print("\n### After ###")
print(model)
### Before ###
{'=': 'sklearn.linear_model.LogisticRegression',
'C': 1.23456,
'max_iter': 987,
'random_state': 42}
### After ###
LogisticRegression(C=1.23456, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=987,
multi_class='warn', n_jobs=None, penalty='l2',
random_state=42, solver='warn', tol=0.0001, verbose=0,
warm_start=False)
This import-less Python object supports nested objects (objects that receives object arguments) by recursive depth-first search.
For more examples, please see Use with PyTorch and parameters.yml
in example/demo projects .
This import-less Python object feature, inspired by the fact that Kedro uses load_obj
for file I/O (DataSet
), uses load_obj
copied from kedro.utils which dynamically imports Python objects using importlib
, a Python standard library.
Anchor-less aliasing (self-lookup)¶
To avoid repeating, YAML natively provides Anchor&Alias Anchor&Alias feature, and Jsonnet provides Variable feature to JSON.
Example:
import yaml
from pprint import pprint # pretty-print for clearer look
# Read parameters dict from a YAML file in actual use
params_yaml="""
train_params:
train_batch_size: &batch_size 32
val_batch_size: *batch_size
"""
parameters = yaml.safe_load(params_yaml)
train_params_dict = parameters.get("train_params")
print("### Conversion by YAML's Anchor&Alias feature ###")
pprint(train_params_dict)
### Conversion by YAML's Anchor&Alias feature ###
{'train_batch_size': 32, 'val_batch_size': 32}
Unfortunately, YAML and Jsonnet require a medium to share the same value.
This is why PipelineX provides Anchor-less aliasing feature.
You can directly look up another value in the same YAML/JSON file using “$” key without an anchor nor variable.
To specify the nested key (key in a dict of dict), use “.” as the separator.
Example:
from pipelinex import HatchDict
import yaml
from pprint import pprint # pretty-print for clearer look
# Read parameters dict from a YAML file in actual use
params_yaml="""
train_params:
train_batch_size: 32
val_batch_size: {$: train_params.train_batch_size}
"""
parameters = yaml.safe_load(params_yaml)
train_params_dict = parameters.get("train_params")
print("### Before ###")
pprint(train_params_dict)
train_params = HatchDict(parameters).get("train_params")
print("\n### After ###")
pprint(train_params)
### Before ###
{'train_batch_size': 32,
'val_batch_size': {'$': 'train_params.train_batch_size'}}
### After ###
{'train_batch_size': 32, 'val_batch_size': 32}
Python expression¶
Strings wrapped in parentheses are evaluated as a Python expression.
from pipelinex import HatchDict
import yaml
from pprint import pprint # pretty-print for clearer look
# Read parameters dict from a YAML file in actual use
params_yaml = """
train_params:
param1_tuple_python: (1, 2, 3)
param1_tuple_yaml: !!python/tuple [1, 2, 3]
param2_formula_python: (2 + 3)
param3_neg_inf_python: (float("-Inf"))
param3_neg_inf_yaml: -.Inf
param4_float_1e9_python: (1e9)
param4_float_1e9_yaml: 1.0e+09
param5_int_1e9_python: (int(1e9))
"""
parameters = yaml.load(params_yaml)
train_params_raw = parameters.get("train_params")
print("### Before ###")
pprint(train_params_raw)
train_params_converted = HatchDict(parameters).get("train_params")
print("\n### After ###")
pprint(train_params_converted)
### Before ###
{'param1_tuple_python': '(1, 2, 3)',
'param1_tuple_yaml': (1, 2, 3),
'param2_formula_python': '(2 + 3)',
'param3_neg_inf_python': '(float("-Inf"))',
'param3_neg_inf_yaml': -inf,
'param4_float_1e9_python': '(1e9)',
'param4_float_1e9_yaml': 1000000000.0,
'param5_int_1e9_python': '(int(1e9))'}
### After ###
{'param1_tuple_python': (1, 2, 3),
'param1_tuple_yaml': (1, 2, 3),
'param2_formula_python': 5,
'param3_neg_inf_python': -inf,
'param3_neg_inf_yaml': -inf,
'param4_float_1e9_python': 1000000000.0,
'param4_float_1e9_yaml': 1000000000.0,
'param5_int_1e9_python': 1000000000}