Skip to main content

Lineage freshness strategies

Atlan reads lineage data from the same Databricks system tables that Databricks uses, but applies intelligent filtering to show only valid lineage relationships. This filtering removes outdated relationships that are no longer relevant, making lineage more useful than Databricks native lineage.

Why Atlan filters lineage

Databricks system tables contain all lineage relationships that have ever occurred, including historical relationships from tables that may have been recreated or restructured. While this historical view can be useful for auditing, it can make it harder to understand how data actually flows through your current data transformations.

Atlan filters this lineage data to focus on relationships that are valid for your current data structures. This filtering provides a cleaner, more useful view of your data flow by:

  • Removing outdated relationships: When tables are recreated, Atlan removes lineage from previous versions that no longer exist.
  • Focusing on valid transformations: Atlan shows only relationships that reflect your current data transformations, not historical operations on restructured tables.
  • Maintaining relevance automatically: Atlan continuously filters lineage to keep it relevant as your data structures change.

Data filtering approaches

Atlan analyzes metadata from Databricks system tables to construct lineage:

  • system.query.history: Contains statement type and statement text
  • system.access.table_lineage: Contains table-level lineage with event timestamps
  • system.access.column_lineage: Contains column-level lineage with event timestamps

How Atlan filters this data depends on what information is available in these tables. Atlan uses different filtering approaches based on whether statement type information is available in system.query.history.

When Databricks includes statement type information in system.query.history, Atlan uses event-driven filtering to process lineage data:

  • Table recreation resets lineage: When Atlan detects a CREATE OR REPLACE command, it removes pre-existing lineage for that table and starts fresh, since the table structure has changed.

  • Pre-DDL queries are filtered out: Atlan excludes queries that executed before the last Data Definition Language (DDL) operation, as these relationships may no longer be valid.

  • Post-DDL DML is included: Atlan includes Data Manipulation Language (DML) operations (INSERT, UPDATE, DELETE, MERGE) that occur after the last DDL, as these represent valid data flow.

This filtering ensures lineage reflects valid relationships for your data transformations, not historical operations on tables that may have been completely restructured. Atlan can precisely identify when tables are recreated and filter accordingly, providing accurate lineage that reflects your current data structures.

Did you know?

For the most accurate lineage, use Databricks serverless compute for SQL transformations whenever possible. This provides Atlan with the statement type metadata needed for intelligent, event-driven lineage management.

See also