HatchDict: Python in YAML/JSON¶
Python objects in YAML/JSON¶
Introduction to YAML¶
YAML is a common text format used for application config files.
YAML’s most notable advantage is allowing users to mix 2 styles, block style and flow style.
Example:
import yaml
from pprint import pprint # pretty-print for clearer look
# Read parameters dict from a YAML file in actual use
params_yaml="""
block_style_demo:
key1: value1
key2: value2
flow_style_demo: {key1: value1, key2: value2}
"""
parameters = yaml.safe_load(params_yaml)
print("### 2 styles in YAML ###")
pprint(parameters)
### 2 styles in YAML ###
{'block_style_demo': {'key1': 'value1', 'key2': 'value2'},
'flow_style_demo': {'key1': 'value1', 'key2': 'value2'}}
To store highly nested (hierarchical) dict or list, YAML is more conveinient than hard-coding in Python code.
YAML’s block style, which uses indentation, allows users to omit opening and closing symbols to specify a Python dict or list (
{}
or[]
).YAML’s flow style, which uses opening and closing symbols, allows users to specify a Python dict or list within a single line.
So simply using YAML with Python will be the best way for Machine Learning experimentation?
Let’s check out the next example.
Example:
import yaml
from pprint import pprint # pretty-print for clearer look
# Read parameters dict from a YAML file in actual use
params_yaml = """
model_kind: LogisticRegression
model_params:
C: 1.23456
max_iter: 987
random_state: 42
"""
parameters = yaml.safe_load(params_yaml)
print("### Before ###")
pprint(parameters)
model_kind = parameters.get("model_kind")
model_params_dict = parameters.get("model_params")
if model_kind == "LogisticRegression":
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(**model_params_dict)
elif model_kind == "DecisionTree":
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(**model_params_dict)
elif model_kind == "RandomForest":
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(**model_params_dict)
else:
raise ValueError("Unsupported model_kind.")
print("\n### After ###")
print(model)
### Before ###
{'model_kind': 'LogisticRegression',
'model_params': {'C': 1.23456, 'max_iter': 987, 'random_state': 42}}
### After ###
LogisticRegression(C=1.23456, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=987,
multi_class='warn', n_jobs=None, penalty='l2',
random_state=42, solver='warn', tol=0.0001, verbose=0,
warm_start=False)
This way is inefficient as we need to add import
and if
statements for the options in the Python code in addition to modifying the YAML config file.
Any better way?
Anchor-less aliasing in YAML/JSON¶
Aliasing in YAML¶
To avoid repeating, YAML natively provides Anchor&Alias Anchor&Alias feature, and Jsonnet provides Variable feature to JSON.
Example:
import yaml
from pprint import pprint # pretty-print for clearer look
# Read parameters dict from a YAML file in actual use
params_yaml="""
train_params:
train_batch_size: &batch_size 32
val_batch_size: *batch_size
"""
parameters = yaml.safe_load(params_yaml)
train_params_dict = parameters.get("train_params")
print("### Conversion by YAML's Anchor&Alias feature ###")
pprint(train_params_dict)
### Conversion by YAML's Anchor&Alias feature ###
{'train_batch_size': 32, 'val_batch_size': 32}
Unfortunately, YAML and Jsonnet require a medium to share the same value.
This is why PipelineX provides anchor-less aliasing feature.
Alternative to aliasing in YAML¶
You can directly look up another value in the same YAML/JSON file using “$” key without an anchor nor variable.
To specify the nested key (key in a dict of dict), use “.” as the separator.
Example:
from pipelinex import HatchDict
import yaml
from pprint import pprint # pretty-print for clearer look
# Read parameters dict from a YAML file in actual use
params_yaml="""
train_params:
train_batch_size: 32
val_batch_size: {$: train_params.train_batch_size}
"""
parameters = yaml.safe_load(params_yaml)
train_params_dict = parameters.get("train_params")
print("### Before ###")
pprint(train_params_dict)
train_params = HatchDict(parameters).get("train_params")
print("\n### After ###")
pprint(train_params)
### Before ###
{'train_batch_size': 32,
'val_batch_size': {'$': 'train_params.train_batch_size'}}
### After ###
{'train_batch_size': 32, 'val_batch_size': 32}
Python expression in YAML/JSON¶
Strings wrapped in parentheses are evaluated as a Python expression.
from pipelinex import HatchDict
import yaml
from pprint import pprint # pretty-print for clearer look
# Read parameters dict from a YAML file in actual use
params_yaml = """
train_params:
param1_tuple_python: (1, 2, 3)
param1_tuple_yaml: !!python/tuple [1, 2, 3]
param2_formula_python: (2 + 3)
param3_neg_inf_python: (float("-Inf"))
param3_neg_inf_yaml: -.Inf
param4_float_1e9_python: (1e9)
param4_float_1e9_yaml: 1.0e+09
param5_int_1e9_python: (int(1e9))
"""
parameters = yaml.load(params_yaml)
train_params_raw = parameters.get("train_params")
print("### Before ###")
pprint(train_params_raw)
train_params_converted = HatchDict(parameters).get("train_params")
print("\n### After ###")
pprint(train_params_converted)
### Before ###
{'param1_tuple_python': '(1, 2, 3)',
'param1_tuple_yaml': (1, 2, 3),
'param2_formula_python': '(2 + 3)',
'param3_neg_inf_python': '(float("-Inf"))',
'param3_neg_inf_yaml': -inf,
'param4_float_1e9_python': '(1e9)',
'param4_float_1e9_yaml': 1000000000.0,
'param5_int_1e9_python': '(int(1e9))'}
### After ###
{'param1_tuple_python': (1, 2, 3),
'param1_tuple_yaml': (1, 2, 3),
'param2_formula_python': 5,
'param3_neg_inf_python': -inf,
'param3_neg_inf_yaml': -inf,
'param4_float_1e9_python': 1000000000.0,
'param4_float_1e9_yaml': 1000000000.0,
'param5_int_1e9_python': 1000000000}