ParquetInput reads data from Parquet files, supporting both single files and directories containing multiple Parquet files. Automatically handles local and object store paths. Can't specify both a single file path (ending with .parquet) and file_names parameter.
ParquetInput inherits from the base Input class and provides specialized functionality for reading columnar data formats.
ParquetInput
Class📁
application_sdk.inputs.parquetInheritance chain:
InputReads data from Parquet files, supporting both single files and directories containing multiple Parquet files. Automatically handles local and object store paths. Can't specify both a single file path (ending with .parquet) and file_names parameter.
Methods5
▸
__init__
__init__(self, path: str, chunk_size: int = 100000, buffer_size: int = 5000, file_names: Optional[List[str]] = None)Initialize ParquetInput with path to Parquet file or directory. Supports local paths and object store paths (s3://, gs://, etc.).
Parameters
pathstrPath to Parquet file or directory. Supports local paths and object store paths (s3://, gs://, etc.)
chunk_sizeintNumber of rows per batch for pandas operations (default: 100000)
buffer_sizeintNumber of rows per batch for daft operations (default: 5000)
file_namesOptional[List[str]]List of specific file names to read from directory
▸
get_dataframe
async
async get_dataframe(self) -> pd.DataFrameReads all specified Parquet files and returns a single combined pandas DataFrame. Reads all files matching the path and file_names filter, combines files using pd.concat() with ignore_index=True. Column schemas must be compatible across files.
Returns
pd.DataFrame - Combined DataFrame from all specified files▸
get_batched_dataframe
async
async get_batched_dataframe(self) -> AsyncIterator[pd.DataFrame]Reads Parquet files and returns batches as pandas DataFrames. Processes each file individually for memory efficiency, splits each file into chunks based on chunk_size. Each batch is a separate DataFrame.
Returns
AsyncIterator[pd.DataFrame] - Iterator yielding batches of rows▸
get_daft_dataframe
async
async get_daft_dataframe(self) -> daft.DataFrameReads all specified Parquet files and returns a single combined daft DataFrame with lazy evaluation. Uses lazy evaluation for better performance, combines all specified files into single DataFrame. Column schemas must be compatible across files.
Returns
daft.DataFrame - Combined daft DataFrame with lazy evaluation▸
get_batched_daft_dataframe
async
async get_batched_daft_dataframe(self) -> AsyncIterator[daft.DataFrame]Reads Parquet files and returns batches as daft DataFrames, processing files individually for memory efficiency. Creates lazy DataFrame without loading all data into memory, yields chunks based on buffer_size parameter.
Returns
AsyncIterator[daft.DataFrame] - Iterator yielding batches as daft DataFramesUsage Examples
Single file
Read a single Parquet file
from application_sdk.inputs import ParquetInput
parquet_input = ParquetInput(path="data/users.parquet")
df = await parquet_input.get_dataframe()
Directory with all files
Read all Parquet files from a directory
parquet_input = ParquetInput(
path="s3://bucket/data/",
chunk_size=100000,
buffer_size=5000
)
async for batch_df in parquet_input.get_batched_dataframe():
# Process each batch
pass
Directory with specific files
Read specific files from a directory
parquet_input = ParquetInput(
path="s3://bucket/data/",
file_names=["file1.parquet", "file2.parquet"],
chunk_size=100000
)
See also
- Inputs: Overview of all input classes and common usage patterns
- SQLQueryInput: Read data from SQL databases by executing SQL queries
- JsonInput: Read data from JSON files in JSONL format
- Application SDK README: Overview of the Application SDK and its components