pipelinex.extras.datasets.pandas package


pipelinex.extras.datasets.pandas.csv_local module

CSVLocalDataSet loads and saves data to a local csv file. The underlying functionality is supported by pandas, so it supports all allowed pandas options for loading and saving csv files.

class pipelinex.extras.datasets.pandas.csv_local.CSVLocalDataSet(filepath, load_args=None, save_args=None, version=None)[source]

Bases: kedro.io.core.AbstractVersionedDataSet

CSVLocalDataSet loads and saves data to a local csv file. The underlying functionality is supported by pandas, so it supports all allowed pandas options for loading and saving csv files.


from kedro.io import CSVLocalDataSet
import pandas as pd

data = pd.DataFrame({'col1': [1, 2], 'col2': [4, 5],
                     'col3': [5, 6]})
data_set = CSVLocalDataSet(filepath="test.csv",
                                 save_args={"index": False})
reloaded = data_set.load()

assert data.equals(reloaded)
DEFAULT_LOAD_ARGS: Dict[str, Any] = {}
DEFAULT_SAVE_ARGS: Dict[str, Any] = {'index': False}
__init__(filepath, load_args=None, save_args=None, version=None)[source]

Creates a new instance of CSVLocalDataSet pointing to a concrete filepath.


ValueError – If ‘filepath’ looks like a remote path.

pipelinex.extras.datasets.pandas.efficient_csv_local module

class pipelinex.extras.datasets.pandas.efficient_csv_local.EfficientCSVLocalDataSet(*args, preview_args=None, margin=100.0, verbose=True, **kwargs)[source]

Bases: pipelinex.extras.datasets.pandas.csv_local.CSVLocalDataSet

DEFAULT_LOAD_ARGS: Dict[str, Any] = {'engine': 'c', 'keep_default_na': False, 'na_values': [''], 'skiprows': 0}
DEFAULT_PREVIEW_ARGS: Dict[str, Any] = {'low_memory': False, 'nrows': None}
__init__(*args, preview_args=None, margin=100.0, verbose=True, **kwargs)[source]

Creates a new instance of PandasDescribeDataSet pointing to a concrete filepath.

pipelinex.extras.datasets.pandas.efficient_csv_local.dict_string_val_prefix(d, prefix)[source]
pipelinex.extras.datasets.pandas.efficient_csv_local.dict_val_replace_except(d, to_except, new_value)[source]

pipelinex.extras.datasets.pandas.fixed_width_csv_dataset module

class pipelinex.extras.datasets.pandas.fixed_width_csv_dataset.FixedWidthCSVDataSet(*args, enable_fixed_width=True, num_decimal_places=9, **kwargs)[source]

Bases: kedro.extras.datasets.pandas.csv_dataset.CSVDataSet

CSVDataSet loads/saves data from/to a CSV file using an underlying filesystem (e.g.: local, S3, GCS). It uses pandas to handle the CSV file.

__init__(*args, enable_fixed_width=True, num_decimal_places=9, **kwargs)[source]

Creates a FixedWidthCSVDataSet pointing to a concrete CSV file on a specific filesystem. :param filepath: Filepath in POSIX format to a CSV file prefixed with a protocol like s3://.

If prefix is not provided, file protocol (local filesystem) will be used. The prefix should be any protocol supported by fsspec. Note: http(s) doesn’t support versioning.

  • load_args – Pandas options for loading CSV files. Here you can find all available arguments: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html All defaults are preserved.

  • save_args – Pandas options for saving CSV files. Here you can find all available arguments: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html All defaults are preserved, but “index”, which is set to False.

  • version – If specified, should be an instance of kedro.io.core.Version. If its load attribute is None, the latest version will be loaded. If its save attribute is None, save version will be autogenerated.

  • credentials – Credentials required to get access to the underlying filesystem. E.g. for GCSFileSystem it should look like {“token”: None}.

  • fs_args – Extra arguments to pass into underlying filesystem class constructor (e.g. {“project”: “my-project”} for GCSFileSystem), as well as to pass to the filesystem’s open method through nested keys open_args_load and open_args_save. Here you can find all available arguments for open: https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.spec.AbstractFileSystem.open All defaults are preserved, except mode, which is set to r when loading and to w when saving.

  • enable_fixed_width (bool) – Save to a CSV file with each column width fixed among all the rows and theheader to improve readability for humans.

  • num_decimal_places (int) – Number of decimal places for float values to save.

pipelinex.extras.datasets.pandas.fixed_width_csv_dataset.fix_width(df, num_decimal_places=9)[source]

pipelinex.extras.datasets.pandas.histgram module

class pipelinex.extras.datasets.pandas.histgram.HistgramDataSet(filepath, save_args=None, hist_args=None)[source]

Bases: kedro.io.core.AbstractDataSet

__init__(filepath, save_args=None, hist_args=None)[source]

Initialize self. See help(type(self)) for accurate signature.

pipelinex.extras.datasets.pandas.pandas_cat_matrix module

class pipelinex.extras.datasets.pandas.pandas_cat_matrix.PandasCatMatrixDataSet(*args, describe_args={}, **kwargs)[source]

Bases: pipelinex.extras.datasets.pandas.csv_local.CSVLocalDataSet

PandasDescribeDataSet saves output of df.describe.

__init__(*args, describe_args={}, **kwargs)[source]

Creates a new instance of PandasCatMatrixDataSet pointing to a concrete filepath.


pipelinex.extras.datasets.pandas.pandas_describe module

class pipelinex.extras.datasets.pandas.pandas_describe.PandasDescribeDataSet(*args, describe_args={}, **kwargs)[source]

Bases: pipelinex.extras.datasets.pandas.csv_local.CSVLocalDataSet

PandasDescribeDataSet saves output of df.describe.

__init__(*args, describe_args={}, **kwargs)[source]

Creates a new instance of PandasDescribeDataSet pointing to a concrete filepath.
