Skip to main content

JsonInput reads data from JSON files, supporting both single files and directories containing multiple JSON files. Supports JSONL (JSON Lines) format where each line is a separate JSON object. Can't specify both a single file path (ending with .json) and file_names parameter.

JsonInput inherits from the base Input class and provides specialized functionality for reading JSON and JSONL formatted data.

JsonInput

Class
📁application_sdk.inputs.json
Inheritance chain:
Input

Reads data from JSON files, supporting both single files and directories containing multiple JSON files. Supports JSONL (JSON Lines) format where each line is a separate JSON object. Can't specify both a single file path (ending with .json) and file_names parameter.

Methods5

__init__

__init__(self, path: str, file_names: Optional[List[str]] = None, chunk_size: int = 100000)
Initialize JsonInput with path to JSON file or directory. Supports local paths and object store paths.
Parameters
pathstr
Required
Path to JSON file or directory. Supports local paths and object store paths
file_namesOptional[List[str]]
Optional
List of specific file names to read from directory
chunk_sizeint
Optional
Number of rows per batch (default: 100000)

get_dataframe

async
async get_dataframe(self) -> pd.DataFrame
Reads all specified JSON files and returns a single combined pandas DataFrame. Files are read as JSONL format (one JSON object per line). Reads files as JSONL format (lines=True), combines files using pd.concat() with ignore_index=True. Each line in the JSON file becomes a row in the DataFrame.
Returns
pd.DataFrame - Combined DataFrame from all specified JSON files

get_batched_dataframe

async
async get_batched_dataframe(self) -> AsyncIterator[pd.DataFrame]
Reads JSON files and returns batches as pandas DataFrames. Reads each file using pd.read_json() with chunksize parameter, processes files sequentially, yielding chunks from each file. Each chunk is a separate DataFrame.
Returns
AsyncIterator[pd.DataFrame] - Iterator yielding batches of rows

get_daft_dataframe

async
async get_daft_dataframe(self) -> daft.DataFrame
Reads all specified JSON files and returns a single combined daft DataFrame from all specified files.
Returns
daft.DataFrame - Combined daft DataFrame from all specified files

get_batched_daft_dataframe

async
async get_batched_daft_dataframe(self) -> AsyncIterator[daft.DataFrame]
Reads JSON files and returns each file as a separate daft DataFrame batch. Each discovered file becomes a separate daft DataFrame batch, uses _chunk_size parameter for internal chunking. Files are processed individually.
Returns
AsyncIterator[daft.DataFrame] - Iterator yielding batches as daft DataFrames

Usage Examples

Single file

Read a single JSON file

from application_sdk.inputs import JsonInput

json_input = JsonInput(path="data/users.json")
df = await json_input.get_dataframe()

Directory with all files

Read all JSON files from a directory

json_input = JsonInput(
path="s3://bucket/data/",
chunk_size=100000
)

async for batch_df in json_input.get_batched_dataframe():
# Process each batch
pass

See also

  • Inputs: Overview of all input classes and common usage patterns
  • SQLQueryInput: Read data from SQL databases by executing SQL queries
  • ParquetInput: Read data from Parquet files, supporting single files and directories
  • Application SDK README: Overview of the Application SDK and its components