pipelinex.extras.datasets.pandas package¶
Submodules¶
pipelinex.extras.datasets.pandas.csv_local module¶
CSVLocalDataSet
loads and saves data to a local csv file. The
underlying functionality is supported by pandas, so it supports all
allowed pandas options for loading and saving csv files.
-
class
pipelinex.extras.datasets.pandas.csv_local.
CSVLocalDataSet
(filepath, load_args=None, save_args=None, version=None)[source]¶ Bases:
pipelinex.extras.datasets.core.AbstractVersionedDataSet
CSVLocalDataSet
loads and saves data to a local csv file. The underlying functionality is supported by pandas, so it supports all allowed pandas options for loading and saving csv files.Example:
from kedro.io import CSVLocalDataSet import pandas as pd data = pd.DataFrame({'col1': [1, 2], 'col2': [4, 5], 'col3': [5, 6]}) data_set = CSVLocalDataSet(filepath="test.csv", load_args=None, save_args={"index": False}) data_set.save(data) reloaded = data_set.load() assert data.equals(reloaded)
-
DEFAULT_LOAD_ARGS
: Dict[str, Any] = {}¶
-
DEFAULT_SAVE_ARGS
: Dict[str, Any] = {'index': False}¶
-
__init__
(filepath, load_args=None, save_args=None, version=None)[source]¶ Creates a new instance of
CSVLocalDataSet
pointing to a concrete filepath.- Parameters:
filepath (
str
) – path to a csv file.load_args (
Optional
[Dict
[str
,Any
]]) – Pandas options for loading csv files. Here you can find all available arguments: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html All defaults are preserved.save_args (
Optional
[Dict
[str
,Any
]]) – Pandas options for saving csv files. Here you can find all available arguments: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html All defaults are preserved, but “index”, which is set to False.version (
Optional
[Version
]) – If specified, should be an instance ofkedro.io.core.Version
. If itsload
attribute is None, the latest version will be loaded. If itssave
attribute is None, save version will be autogenerated.
- Raises:
ValueError – If ‘filepath’ looks like a remote path.
-
pipelinex.extras.datasets.pandas.efficient_csv_local module¶
-
class
pipelinex.extras.datasets.pandas.efficient_csv_local.
EfficientCSVLocalDataSet
(*args, preview_args=None, margin=100.0, verbose=True, **kwargs)[source]¶ Bases:
pipelinex.extras.datasets.pandas.csv_local.CSVLocalDataSet
-
DEFAULT_LOAD_ARGS
: Dict[str, Any] = {'engine': 'c', 'keep_default_na': False, 'na_values': [''], 'skiprows': 0}¶
-
DEFAULT_PREVIEW_ARGS
: Dict[str, Any] = {'low_memory': False, 'nrows': None}¶
-
__init__
(*args, preview_args=None, margin=100.0, verbose=True, **kwargs)[source]¶ Creates a new instance of
PandasDescribeDataSet
pointing to a concrete filepath.- Parameters:
args – Positional arguments for
CSVLocalDataSet
preview_args (
Optional
[Dict
[str
,Any
]]) – Arguments passed on todf.describe
. See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html for details.kwargs – Keyword arguments for
CSVLocalDataSet
-
pipelinex.extras.datasets.pandas.fixed_width_csv_dataset module¶
-
class
pipelinex.extras.datasets.pandas.fixed_width_csv_dataset.
FixedWidthCSVDataSet
(*args, enable_fixed_width=True, num_decimal_places=9, **kwargs)[source]¶ Bases:
kedro.io.core.AbstractVersionedDataset
[pandas.core.frame.DataFrame
,pandas.core.frame.DataFrame
]CSVDataSet
loads/saves data from/to a CSV file using an underlying filesystem (e.g.: local, S3, GCS). It uses pandas to handle the CSV file.-
__init__
(*args, enable_fixed_width=True, num_decimal_places=9, **kwargs)[source]¶ Creates a
FixedWidthCSVDataSet
pointing to a concrete CSV file on a specific filesystem. :param filepath: Filepath in POSIX format to a CSV file prefixed with a protocol like s3://.If prefix is not provided, file protocol (local filesystem) will be used. The prefix should be any protocol supported by
fsspec
. Note: http(s) doesn’t support versioning.- Parameters:
load_args – Pandas options for loading CSV files. Here you can find all available arguments: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html All defaults are preserved.
save_args – Pandas options for saving CSV files. Here you can find all available arguments: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html All defaults are preserved, but “index”, which is set to False.
version – If specified, should be an instance of
kedro.io.core.Version
. If itsload
attribute is None, the latest version will be loaded. If itssave
attribute is None, save version will be autogenerated.credentials – Credentials required to get access to the underlying filesystem. E.g. for
GCSFileSystem
it should look like {“token”: None}.fs_args – Extra arguments to pass into underlying filesystem class constructor (e.g. {“project”: “my-project”} for
GCSFileSystem
), as well as to pass to the filesystem’s open method through nested keys open_args_load and open_args_save. Here you can find all available arguments for open: https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.spec.AbstractFileSystem.open All defaults are preserved, except mode, which is set to r when loading and to w when saving.enable_fixed_width (
bool
) – Save to a CSV file with each column width fixed among all the rows and theheader to improve readability for humans.num_decimal_places (
int
) – Number of decimal places for float values to save.
-
pipelinex.extras.datasets.pandas.histgram module¶
pipelinex.extras.datasets.pandas.pandas_cat_matrix module¶
-
class
pipelinex.extras.datasets.pandas.pandas_cat_matrix.
PandasCatMatrixDataSet
(*args, describe_args={}, **kwargs)[source]¶ Bases:
pipelinex.extras.datasets.pandas.csv_local.CSVLocalDataSet
PandasDescribeDataSet
saves output ofdf.describe
.-
__init__
(*args, describe_args={}, **kwargs)[source]¶ Creates a new instance of
PandasCatMatrixDataSet
pointing to a concrete filepath.- Parameters:
args – Positional arguments for
CSVLocalDataSet
describe_args (
Dict
[str
,Any
]) – Arguments passed on todf.describe
. See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html for details.kwargs – Keyword arguments for
CSVLocalDataSet
-
pipelinex.extras.datasets.pandas.pandas_describe module¶
-
class
pipelinex.extras.datasets.pandas.pandas_describe.
PandasDescribeDataSet
(*args, describe_args={}, **kwargs)[source]¶ Bases:
pipelinex.extras.datasets.pandas.csv_local.CSVLocalDataSet
PandasDescribeDataSet
saves output ofdf.describe
.-
__init__
(*args, describe_args={}, **kwargs)[source]¶ Creates a new instance of
PandasDescribeDataSet
pointing to a concrete filepath.- Parameters:
args – Positional arguments for
CSVLocalDataSet
describe_args (
Dict
[str
,Any
]) – Arguments passed on todf.describe
. See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html for details.kwargs – Keyword arguments for
CSVLocalDataSet
-