ParquetInput reads data from Parquet files, supporting both single files and directories containing multiple Parquet files. Automatically handles local and object store paths. Can't specify both a single file path (ending with .parquet) and file_names parameter.

ParquetInput inherits from the base Input class and provides specialized functionality for reading columnar data formats.

ParquetInput

Class

📁application_sdk.inputs.parquet

Inheritance chain:

Input

Reads data from Parquet files, supporting both single files and directories containing multiple Parquet files. Automatically handles local and object store paths. Can't specify both a single file path (ending with .parquet) and file_names parameter.

Methods5

▸

init

__init__(self, path: str, chunk_size: int = 100000, buffer_size: int = 5000, file_names: Optional[List[str]] = None)

Initialize ParquetInput with path to Parquet file or directory. Supports local paths and object store paths (s3://, gs://, etc.).

Parameters

pathstr

Required

Path to Parquet file or directory. Supports local paths and object store paths (s3://, gs://, etc.)

chunk_sizeint

Optional

Number of rows per batch for pandas operations (default: 100000)

buffer_sizeint

Optional

Number of rows per batch for daft operations (default: 5000)

file_namesOptional[List[str]]

Optional

List of specific file names to read from directory

▸

get_dataframe

async

async get_dataframe(self) -> pd.DataFrame

Reads all specified Parquet files and returns a single combined pandas DataFrame. Reads all files matching the path and file_names filter, combines files using pd.concat() with ignore_index=True. Column schemas must be compatible across files.

Returns

pd.DataFrame - Combined DataFrame from all specified files

▸

get_batched_dataframe

async

async get_batched_dataframe(self) -> AsyncIterator[pd.DataFrame]

Reads Parquet files and returns batches as pandas DataFrames. Processes each file individually for memory efficiency, splits each file into chunks based on chunk_size. Each batch is a separate DataFrame.

Returns

AsyncIterator[pd.DataFrame] - Iterator yielding batches of rows

▸

get_daft_dataframe

async

async get_daft_dataframe(self) -> daft.DataFrame

Reads all specified Parquet files and returns a single combined daft DataFrame with lazy evaluation. Uses lazy evaluation for better performance, combines all specified files into single DataFrame. Column schemas must be compatible across files.

Returns

daft.DataFrame - Combined daft DataFrame with lazy evaluation

▸

get_batched_daft_dataframe

async

async get_batched_daft_dataframe(self) -> AsyncIterator[daft.DataFrame]

Reads Parquet files and returns batches as daft DataFrames, processing files individually for memory efficiency. Creates lazy DataFrame without loading all data into memory, yields chunks based on buffer_size parameter.

Returns

AsyncIterator[daft.DataFrame] - Iterator yielding batches as daft DataFrames

Usage Examples

Single file

Read a single Parquet file

from application_sdk.inputs import ParquetInput

parquet_input = ParquetInput(path="data/users.parquet")
df = await parquet_input.get_dataframe()

Directory with all files

Read all Parquet files from a directory

parquet_input = ParquetInput(
  path="s3://bucket/data/",
  chunk_size=100000,
  buffer_size=5000
)

async for batch_df in parquet_input.get_batched_dataframe():
  # Process each batch
  pass

Directory with specific files

Read specific files from a directory

parquet_input = ParquetInput(
  path="s3://bucket/data/",
  file_names=["file1.parquet", "file2.parquet"],
  chunk_size=100000
)

ParquetInput

Methods5

__init__

Parameters

get_dataframe

Returns

get_batched_dataframe

Returns

get_daft_dataframe

Returns

get_batched_daft_dataframe

Returns

Usage Examples

Single file

Directory with all files

Directory with specific files

See also​

init

See also