IcebergInput reads data from Apache Iceberg tables using daft. Provides support for reading Iceberg table data as DataFrames. Note: get_batched_dataframe() and get_batched_daft_dataframe() aren't implemented due to memory concerns with partitioning large Iceberg tables.

IcebergInput inherits from the base Input class and provides specialized functionality for reading from Apache Iceberg tables.

IcebergInput

Class

📁application_sdk.inputs.iceberg

Inheritance chain:

Input

Reads data from Apache Iceberg tables using daft. Provides support for reading Iceberg table data as DataFrames. Note: get_batched_dataframe() and get_batched_daft_dataframe() aren't implemented due to memory concerns with partitioning large Iceberg tables.

Methods5

▸

init

__init__(self, table: Table, chunk_size: Optional[int] = 100000)

Initialize IcebergInput with PyIceberg Table object. chunk_size is currently unused but reserved for future use.

Parameters

tableTable

Required

PyIceberg Table object representing the Iceberg table

chunk_sizeOptional[int]

Optional

Number of rows per batch (default: 100000). Currently unused but reserved for future use

▸

get_dataframe

async

async get_dataframe(self) -> pd.DataFrame

Reads data from Iceberg table and returns as a single pandas DataFrame. Internally uses daft and converts to pandas. Uses get_daft_dataframe() internally and converts to pandas using to_pandas().

Returns

pd.DataFrame - Complete table data as pandas DataFrame

▸

get_daft_dataframe

async

async get_daft_dataframe(self) -> daft.DataFrame

Reads data from Iceberg table and returns as a single daft DataFrame with lazy evaluation. Uses daft.read_iceberg() to read the table.

Returns

daft.DataFrame - Table data as daft DataFrame with lazy evaluation

▸

get_batched_dataframe

async

async get_batched_dataframe(self) -> AsyncIterator[pd.DataFrame]

Not implemented - Raises NotImplementedError. This method is intentionally not implemented because partitioning daft DataFrames using dataframe.into_partitions() loads all partitions into memory, which can cause out-of-memory issues for large tables.

Returns

NotImplementedError - Raises NotImplementedError

▸

get_batched_daft_dataframe

async

async get_batched_daft_dataframe(self) -> AsyncIterator[daft.DataFrame]

Not implemented - Raises NotImplementedError. This method is intentionally not implemented for the same reason as get_batched_dataframe() - memory concerns with partitioning large Iceberg tables.

Returns

NotImplementedError - Raises NotImplementedError

Usage Examples

Read Iceberg table

Initialize with Iceberg table object and read data

from application_sdk.inputs import IcebergInput
from pyiceberg.table import Table

iceberg_input = IcebergInput(
  table=iceberg_table,  # pyiceberg.table.Table instance
  chunk_size=100000
)

df = await iceberg_input.get_dataframe()

IcebergInput

Methods5

__init__

Parameters

get_dataframe

Returns

get_daft_dataframe

Returns

get_batched_dataframe

Returns

get_batched_daft_dataframe

Returns

Usage Examples

Read Iceberg table

See also​

init

See also