pipelinex.extras.datasets.pandas package

Submodules

pipelinex.extras.datasets.pandas.csv_local module

CSVLocalDataSet loads and saves data to a local csv file. The underlying functionality is supported by pandas, so it supports all allowed pandas options for loading and saving csv files.

class pipelinex.extras.datasets.pandas.csv_local.CSVLocalDataSet(filepath, load_args=None, save_args=None, version=None)[source]

Bases: pipelinex.extras.datasets.core.AbstractVersionedDataSet

CSVLocalDataSet loads and saves data to a local csv file. The underlying functionality is supported by pandas, so it supports all allowed pandas options for loading and saving csv files.

Example:

from kedro.io import CSVLocalDataSet
import pandas as pd

data = pd.DataFrame({'col1': [1, 2], 'col2': [4, 5],
                     'col3': [5, 6]})
data_set = CSVLocalDataSet(filepath="test.csv",
                                 load_args=None,
                                 save_args={"index": False})
data_set.save(data)
reloaded = data_set.load()

assert data.equals(reloaded)
DEFAULT_LOAD_ARGS: Dict[str, Any] = {}
DEFAULT_SAVE_ARGS: Dict[str, Any] = {'index': False}
__init__(filepath, load_args=None, save_args=None, version=None)[source]

Creates a new instance of CSVLocalDataSet pointing to a concrete filepath.

Parameters:
Raises:

ValueError – If ‘filepath’ looks like a remote path.

pipelinex.extras.datasets.pandas.efficient_csv_local module

class pipelinex.extras.datasets.pandas.efficient_csv_local.EfficientCSVLocalDataSet(*args, preview_args=None, margin=100.0, verbose=True, **kwargs)[source]

Bases: pipelinex.extras.datasets.pandas.csv_local.CSVLocalDataSet

DEFAULT_LOAD_ARGS: Dict[str, Any] = {'engine': 'c', 'keep_default_na': False, 'na_values': [''], 'skiprows': 0}
DEFAULT_PREVIEW_ARGS: Dict[str, Any] = {'low_memory': False, 'nrows': None}
__init__(*args, preview_args=None, margin=100.0, verbose=True, **kwargs)[source]

Creates a new instance of PandasDescribeDataSet pointing to a concrete filepath.

Parameters:
pipelinex.extras.datasets.pandas.efficient_csv_local.dict_string_val_prefix(d, prefix)[source]
pipelinex.extras.datasets.pandas.efficient_csv_local.dict_val_replace_except(d, to_except, new_value)[source]

pipelinex.extras.datasets.pandas.fixed_width_csv_dataset module

class pipelinex.extras.datasets.pandas.fixed_width_csv_dataset.FixedWidthCSVDataSet(*args, enable_fixed_width=True, num_decimal_places=9, **kwargs)[source]

Bases: kedro.io.core.AbstractVersionedDataset[pandas.core.frame.DataFrame, pandas.core.frame.DataFrame]

CSVDataSet loads/saves data from/to a CSV file using an underlying filesystem (e.g.: local, S3, GCS). It uses pandas to handle the CSV file.

__init__(*args, enable_fixed_width=True, num_decimal_places=9, **kwargs)[source]

Creates a FixedWidthCSVDataSet pointing to a concrete CSV file on a specific filesystem. :param filepath: Filepath in POSIX format to a CSV file prefixed with a protocol like s3://.

If prefix is not provided, file protocol (local filesystem) will be used. The prefix should be any protocol supported by fsspec. Note: http(s) doesn’t support versioning.

Parameters:
  • load_args – Pandas options for loading CSV files. Here you can find all available arguments: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html All defaults are preserved.

  • save_args – Pandas options for saving CSV files. Here you can find all available arguments: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html All defaults are preserved, but “index”, which is set to False.

  • version – If specified, should be an instance of kedro.io.core.Version. If its load attribute is None, the latest version will be loaded. If its save attribute is None, save version will be autogenerated.

  • credentials – Credentials required to get access to the underlying filesystem. E.g. for GCSFileSystem it should look like {“token”: None}.

  • fs_args – Extra arguments to pass into underlying filesystem class constructor (e.g. {“project”: “my-project”} for GCSFileSystem), as well as to pass to the filesystem’s open method through nested keys open_args_load and open_args_save. Here you can find all available arguments for open: https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.spec.AbstractFileSystem.open All defaults are preserved, except mode, which is set to r when loading and to w when saving.

  • enable_fixed_width (bool) – Save to a CSV file with each column width fixed among all the rows and theheader to improve readability for humans.

  • num_decimal_places (int) – Number of decimal places for float values to save.

pipelinex.extras.datasets.pandas.fixed_width_csv_dataset.fix_width(df, num_decimal_places=9)[source]

pipelinex.extras.datasets.pandas.histgram module

class pipelinex.extras.datasets.pandas.histgram.HistgramDataSet(filepath, save_args=None, hist_args=None)[source]

Bases: pipelinex.extras.datasets.core.AbstractDataSet

__init__(filepath, save_args=None, hist_args=None)[source]

Initialize self. See help(type(self)) for accurate signature.

pipelinex.extras.datasets.pandas.pandas_cat_matrix module

class pipelinex.extras.datasets.pandas.pandas_cat_matrix.PandasCatMatrixDataSet(*args, describe_args={}, **kwargs)[source]

Bases: pipelinex.extras.datasets.pandas.csv_local.CSVLocalDataSet

PandasDescribeDataSet saves output of df.describe.

__init__(*args, describe_args={}, **kwargs)[source]

Creates a new instance of PandasCatMatrixDataSet pointing to a concrete filepath.

Parameters:

pipelinex.extras.datasets.pandas.pandas_describe module

class pipelinex.extras.datasets.pandas.pandas_describe.PandasDescribeDataSet(*args, describe_args={}, **kwargs)[source]

Bases: pipelinex.extras.datasets.pandas.csv_local.CSVLocalDataSet

PandasDescribeDataSet saves output of df.describe.

__init__(*args, describe_args={}, **kwargs)[source]

Creates a new instance of PandasDescribeDataSet pointing to a concrete filepath.

Parameters: