Skip to main content

ParquetInput reads data from Parquet files, supporting both single files and directories containing multiple Parquet files. Automatically handles local and object store paths. Can't specify both a single file path (ending with .parquet) and file_names parameter.

ParquetInput inherits from the base Input class and provides specialized functionality for reading columnar data formats.

ParquetInput

Class
📁application_sdk.inputs.parquet
Inheritance chain:
Input

Reads data from Parquet files, supporting both single files and directories containing multiple Parquet files. Automatically handles local and object store paths. Can't specify both a single file path (ending with .parquet) and file_names parameter.

Methods5

__init__

__init__(self, path: str, chunk_size: int = 100000, buffer_size: int = 5000, file_names: Optional[List[str]] = None)
Initialize ParquetInput with path to Parquet file or directory. Supports local paths and object store paths (s3://, gs://, etc.).
Parameters
pathstr
Required
Path to Parquet file or directory. Supports local paths and object store paths (s3://, gs://, etc.)
chunk_sizeint
Optional
Number of rows per batch for pandas operations (default: 100000)
buffer_sizeint
Optional
Number of rows per batch for daft operations (default: 5000)
file_namesOptional[List[str]]
Optional
List of specific file names to read from directory

get_dataframe

async
async get_dataframe(self) -> pd.DataFrame
Reads all specified Parquet files and returns a single combined pandas DataFrame. Reads all files matching the path and file_names filter, combines files using pd.concat() with ignore_index=True. Column schemas must be compatible across files.
Returns
pd.DataFrame - Combined DataFrame from all specified files

get_batched_dataframe

async
async get_batched_dataframe(self) -> AsyncIterator[pd.DataFrame]
Reads Parquet files and returns batches as pandas DataFrames. Processes each file individually for memory efficiency, splits each file into chunks based on chunk_size. Each batch is a separate DataFrame.
Returns
AsyncIterator[pd.DataFrame] - Iterator yielding batches of rows

get_daft_dataframe

async
async get_daft_dataframe(self) -> daft.DataFrame
Reads all specified Parquet files and returns a single combined daft DataFrame with lazy evaluation. Uses lazy evaluation for better performance, combines all specified files into single DataFrame. Column schemas must be compatible across files.
Returns
daft.DataFrame - Combined daft DataFrame with lazy evaluation

get_batched_daft_dataframe

async
async get_batched_daft_dataframe(self) -> AsyncIterator[daft.DataFrame]
Reads Parquet files and returns batches as daft DataFrames, processing files individually for memory efficiency. Creates lazy DataFrame without loading all data into memory, yields chunks based on buffer_size parameter.
Returns
AsyncIterator[daft.DataFrame] - Iterator yielding batches as daft DataFrames

Usage Examples

Single file

Read a single Parquet file

from application_sdk.inputs import ParquetInput

parquet_input = ParquetInput(path="data/users.parquet")
df = await parquet_input.get_dataframe()

Directory with all files

Read all Parquet files from a directory

parquet_input = ParquetInput(
path="s3://bucket/data/",
chunk_size=100000,
buffer_size=5000
)

async for batch_df in parquet_input.get_batched_dataframe():
# Process each batch
pass

Directory with specific files

Read specific files from a directory

parquet_input = ParquetInput(
path="s3://bucket/data/",
file_names=["file1.parquet", "file2.parquet"],
chunk_size=100000
)

See also

  • Inputs: Overview of all input classes and common usage patterns
  • SQLQueryInput: Read data from SQL databases by executing SQL queries
  • JsonInput: Read data from JSON files in JSONL format
  • Application SDK README: Overview of the Application SDK and its components