IcebergInput reads data from Apache Iceberg tables using daft. Provides support for reading Iceberg table data as DataFrames. Note: get_batched_dataframe() and get_batched_daft_dataframe() aren't implemented due to memory concerns with partitioning large Iceberg tables.
IcebergInput inherits from the base Input class and provides specialized functionality for reading from Apache Iceberg tables.
IcebergInput
Class📁
application_sdk.inputs.icebergInheritance chain:
InputReads data from Apache Iceberg tables using daft. Provides support for reading Iceberg table data as DataFrames. Note: get_batched_dataframe() and get_batched_daft_dataframe() aren't implemented due to memory concerns with partitioning large Iceberg tables.
Methods5
▸
__init__
__init__(self, table: Table, chunk_size: Optional[int] = 100000)Initialize IcebergInput with PyIceberg Table object. chunk_size is currently unused but reserved for future use.
Parameters
tableTablePyIceberg Table object representing the Iceberg table
chunk_sizeOptional[int]Number of rows per batch (default: 100000). Currently unused but reserved for future use
▸
get_dataframe
async
async get_dataframe(self) -> pd.DataFrameReads data from Iceberg table and returns as a single pandas DataFrame. Internally uses daft and converts to pandas. Uses get_daft_dataframe() internally and converts to pandas using to_pandas().
Returns
pd.DataFrame - Complete table data as pandas DataFrame▸
get_daft_dataframe
async
async get_daft_dataframe(self) -> daft.DataFrameReads data from Iceberg table and returns as a single daft DataFrame with lazy evaluation. Uses daft.read_iceberg() to read the table.
Returns
daft.DataFrame - Table data as daft DataFrame with lazy evaluation▸
get_batched_dataframe
async
async get_batched_dataframe(self) -> AsyncIterator[pd.DataFrame]Not implemented - Raises NotImplementedError. This method is intentionally not implemented because partitioning daft DataFrames using dataframe.into_partitions() loads all partitions into memory, which can cause out-of-memory issues for large tables.
Returns
NotImplementedError - Raises NotImplementedError▸
get_batched_daft_dataframe
async
async get_batched_daft_dataframe(self) -> AsyncIterator[daft.DataFrame]Not implemented - Raises NotImplementedError. This method is intentionally not implemented for the same reason as get_batched_dataframe() - memory concerns with partitioning large Iceberg tables.
Returns
NotImplementedError - Raises NotImplementedErrorUsage Examples
Read Iceberg table
Initialize with Iceberg table object and read data
from application_sdk.inputs import IcebergInput
from pyiceberg.table import Table
iceberg_input = IcebergInput(
table=iceberg_table, # pyiceberg.table.Table instance
chunk_size=100000
)
df = await iceberg_input.get_dataframe()
See also
- Inputs: Overview of all input classes and common usage patterns
- SQLQueryInput: Read data from SQL databases by executing SQL queries
- ParquetInput: Read data from Parquet files, supporting single files and directories
- Application SDK README: Overview of the Application SDK and its components