Lineage freshness strategies
Atlan reads lineage data from the same Databricks system tables that Databricks uses, but applies intelligent filtering to show only valid lineage relationships. This filtering removes outdated relationships that are no longer relevant, making lineage more useful than Databricks native lineage.
Why Atlan filters lineage
Databricks system tables contain all lineage relationships that have ever occurred, including historical relationships from tables that may have been recreated or restructured. While this historical view can be useful for auditing, it can make it harder to understand how data actually flows through your current data transformations.
Atlan filters this lineage data to focus on relationships that are valid for your current data structures. This filtering provides a cleaner, more useful view of your data flow by:
- Removing outdated relationships: When tables are recreated, Atlan removes lineage from previous versions that no longer exist.
- Focusing on valid transformations: Atlan shows only relationships that reflect your current data transformations, not historical operations on restructured tables.
- Maintaining relevance automatically: Atlan continuously filters lineage to keep it relevant as your data structures change.
Data filtering approaches
Atlan analyzes metadata from Databricks system tables to construct lineage:
system.query.history: Contains statement type and statement textsystem.access.table_lineage: Contains table-level lineage with event timestampssystem.access.column_lineage: Contains column-level lineage with event timestamps
How Atlan filters this data depends on what information is available in these tables. Atlan uses different filtering approaches based on whether statement type information is available in system.query.history.
- When statement types are available
- When statement types aren't available
When Databricks includes statement type information in system.query.history, Atlan uses event-driven filtering to process lineage data:
-
Table recreation resets lineage: When Atlan detects a
CREATE OR REPLACEcommand, it removes pre-existing lineage for that table and starts fresh, since the table structure has changed. -
Pre-DDL queries are filtered out: Atlan excludes queries that executed before the last Data Definition Language (DDL) operation, as these relationships may no longer be valid.
-
Post-DDL DML is included: Atlan includes Data Manipulation Language (DML) operations (
INSERT,UPDATE,DELETE,MERGE) that occur after the last DDL, as these represent valid data flow.
This filtering ensures lineage reflects valid relationships for your data transformations, not historical operations on tables that may have been completely restructured. Atlan can precisely identify when tables are recreated and filter accordingly, providing accurate lineage that reflects your current data structures.
For the most accurate lineage, use Databricks serverless compute for SQL transformations whenever possible. This provides Atlan with the statement type metadata needed for intelligent, event-driven lineage management.
When statement type information isn't available in system.query.history (which can occur with certain compute types), Atlan uses time-based filtering. For more details about when statement types are available, see the Databricks documentation.
Without statement types, Atlan can't determine when a table was recreated, whether a query was DDL or DML, or the precise moment lineage resets. To handle this scenario, Atlan applies time-based filtering to the lineage data from Databricks.
When statement types aren't available, Atlan filters lineage data based on timestamps:
-
New lineage triggers evaluation: When Atlan discovers new lineage relationships for an asset, it examines the age of all input relationships from the Databricks lineage data.
-
Time window filtering: Atlan removes lineage from inputs older than a configurable threshold. Only lineage within this time window is retained.
-
Activity-driven updates: This cleanup only happens when new lineage is detected. If an asset receives no updates, its existing lineage persists unchanged.
-
Natural self-maintenance: The result is that lineage naturally stays relevant:
- Frequently updated lineage relationships remain active
- Irrelevant or stale lineage gradually disappears
- You see valid relationships for your data
While this approach maintains relevance and provides filtered lineage, it has some constraints:
- Can't detect table recreation events precisely
- May retain some irrelevant lineage for up to the configured time window
- Relies on activity patterns rather than explicit DDL events
Even with these constraints, you still get filtered lineage that focuses on relevant relationships rather than all historical relationships. Time-based filtering may require configuration adjustments based on your data refresh patterns to optimize accuracy.
The time window for lineage retention can be adjusted based on your data refresh patterns and business needs. Contact your Atlan administrator to tune this setting for your workspace.
See also
- Atlan vs Databricks lineage: Frequently asked questions about how Atlan lineage differs from Databricks native lineage.
- Extract lineage and usage from Databricks: Configure lineage and usage extraction for your Databricks connection.
- What does Atlan crawl from Databricks: Learn about the Databricks assets and metadata that Atlan discovers.