Enrich Glue table statistics from S3 Inventory Files App
The Enrich Glue table statistics from S3 Inventory Files app updates Glue table statistics in Atlan by reading S3 inventory data. It calculates table size, object count, and last updated timestamp from inventory files so that Glue table metadata in Atlan reflects what S3 contains.
The app discovers Glue tables that have an externalLocation pointing to S3, reads S3 inventory parquet files grouped by bucket and key prefix, filters out directory entries, and computes size, object count, and the most recent modified time for each table. It then compares these values to the current metadata in Atlan and updates only the tables where statistics have changed.
This reference explains how to configure the app in the product UI and how the enrichment behaves once it runs.
Access
The Enrich Glue table statistics from S3 Inventory Files app isn't enabled by default. To use this app, contact Atlan support and request it be added to your tenant.
Once enabled, only admins or users with workflow permissions can configure and run the app.
Existing Glue connection
Specifies the existing AWS Glue connection in Atlan that contains the tables you want to enrich.
The workflow reads Glue catalog tables from this connection, matches them to S3 inventory data, and writes updated statistics back to the same Glue assets in Atlan. Choose the connection that represents the Glue environment where your external tables are registered.
Inventory files URI
Defines the S3 location where the app looks for S3 inventory parquet files.
Provide the URI or prefix that points to the folder where S3 publishes inventory files for the buckets that store your Glue table data. The app scans this location, picks the latest inventory files, and uses them to calculate table statistics.
The inventory data must be in parquet format and must include, at minimum:
- The S3 bucket name for each object.
- The full S3 key (path) for each object.
- The object size in bytes.
- The last modified timestamp for each object.
The app automatically selects the most recent inventory file per directory when multiple inventory files are available under the configured prefix.
When inventory files are processed, Glue tables are matched to inventory rows using the table's externalLocation:
- The S3 bucket is extracted from the
externalLocationand compared to thebucketvalue in inventory. - The S3 prefix is extracted from the
externalLocationand compared to thekeypath in inventory.
Only inventory objects that match both the bucket and prefix for a table are included when calculating size, object count, and last updated timestamp.
Example:
s3://my-inventory-bucket/inventory/analytics/
Glue target filters
Controls which Glue tables are eligible for enrichment.
Use this configuration to narrow the scope of the app so that it only processes a subset of tables from the selected Glue connection, instead of all tables. This helps reduce runtime and lets you focus enrichment on specific databases or table groups that rely on S3 inventory.
You can use filters to:
- Limit enrichment to specific Glue databases.
- Target tables whose names follow a prefix or pattern (such as
sales_or_ext). - Exclude temporary or staging tables that don't need inventory-based statistics.
Example: filter by database
Include only tables from database: analytics
This configuration enriches statistics only for tables in the analytics Glue database, even if the connection contains other databases such as raw or sandbox.
Example: filter by table name pattern
Include tables with name prefix: sales_
This configuration restricts enrichment to tables whose names start with sales_, such as sales_orders or sales_invoices, and skips unrelated tables in the same database.
Crawl last updated timestamp only
Determines whether the app updates only the last updated timestamp or all available statistics.
- When selected, the app updates only the
sourceUpdatedAtvalue from S3 inventory and leaves existing size and object count statistics unchanged. - When not selected, the app updates size, object count, and last updated timestamp together.
Use this option when another process (such as the AWS Glue Crawler) already maintains size and object count, and you only want to refresh the last modified timestamp from S3 inventory.
Enriched statistics
The enrichment app updates the following table statistics from S3 inventory data:
| Atlan property | Description | Source |
|---|---|---|
sizeBytes | Total size of all objects in the table's S3 location, in bytes | Sum of size column from S3 inventory |
tableObjectCount | Number of objects (files) in the table's S3 location | Count of objects from S3 inventory |
sourceUpdatedAt | Most recent modification timestamp across all objects | Maximum last_modified_date from S3 inventory |
When Crawl last updated timestamp only is enabled, only sourceUpdatedAt is updated. When it's disabled, all three properties can be updated together.
Limitations
- Tables without an
externalLocationproperty are skipped. - Tables with S3 locations that have no matching inventory data are skipped.
- Directory entries (keys ending with
/) are excluded from statistics calculations. - The app processes inventory files in batches and updates tables in batches of 20.
See also
- Cataloging and analyzing your data with S3 Inventory: Official S3 Inventory user guide.
- What does Atlan crawl from AWS Glue?: Complete reference for all metadata crawled from AWS Glue.