Skip to main content

IcebergInput reads data from Apache Iceberg tables using daft. Provides support for reading Iceberg table data as DataFrames. Note: get_batched_dataframe() and get_batched_daft_dataframe() aren't implemented due to memory concerns with partitioning large Iceberg tables.

IcebergInput inherits from the base Input class and provides specialized functionality for reading from Apache Iceberg tables.

IcebergInput

Class
📁application_sdk.inputs.iceberg
Inheritance chain:
Input

Reads data from Apache Iceberg tables using daft. Provides support for reading Iceberg table data as DataFrames. Note: get_batched_dataframe() and get_batched_daft_dataframe() aren't implemented due to memory concerns with partitioning large Iceberg tables.

Methods5

__init__

__init__(self, table: Table, chunk_size: Optional[int] = 100000)
Initialize IcebergInput with PyIceberg Table object. chunk_size is currently unused but reserved for future use.
Parameters
tableTable
Required
PyIceberg Table object representing the Iceberg table
chunk_sizeOptional[int]
Optional
Number of rows per batch (default: 100000). Currently unused but reserved for future use

get_dataframe

async
async get_dataframe(self) -> pd.DataFrame
Reads data from Iceberg table and returns as a single pandas DataFrame. Internally uses daft and converts to pandas. Uses get_daft_dataframe() internally and converts to pandas using to_pandas().
Returns
pd.DataFrame - Complete table data as pandas DataFrame

get_daft_dataframe

async
async get_daft_dataframe(self) -> daft.DataFrame
Reads data from Iceberg table and returns as a single daft DataFrame with lazy evaluation. Uses daft.read_iceberg() to read the table.
Returns
daft.DataFrame - Table data as daft DataFrame with lazy evaluation

get_batched_dataframe

async
async get_batched_dataframe(self) -> AsyncIterator[pd.DataFrame]
Not implemented - Raises NotImplementedError. This method is intentionally not implemented because partitioning daft DataFrames using dataframe.into_partitions() loads all partitions into memory, which can cause out-of-memory issues for large tables.
Returns
NotImplementedError - Raises NotImplementedError

get_batched_daft_dataframe

async
async get_batched_daft_dataframe(self) -> AsyncIterator[daft.DataFrame]
Not implemented - Raises NotImplementedError. This method is intentionally not implemented for the same reason as get_batched_dataframe() - memory concerns with partitioning large Iceberg tables.
Returns
NotImplementedError - Raises NotImplementedError

Usage Examples

Read Iceberg table

Initialize with Iceberg table object and read data

from application_sdk.inputs import IcebergInput
from pyiceberg.table import Table

iceberg_input = IcebergInput(
table=iceberg_table, # pyiceberg.table.Table instance
chunk_size=100000
)

df = await iceberg_input.get_dataframe()

See also

  • Inputs: Overview of all input classes and common usage patterns
  • SQLQueryInput: Read data from SQL databases by executing SQL queries
  • ParquetInput: Read data from Parquet files, supporting single files and directories
  • Application SDK README: Overview of the Application SDK and its components