Skip to main content

Lineage analysis

Connect docs via MCP

This page provides SQL queries for lineage analysis scenarios using Atlan Lakehouse.

These queries use the ASSETS table from the GOLD namespace in the Lakehouse catalog and the LINEAGE table, which you create separately in your warehouse (see Set up lineage tables).

Before you begin

Before you run these queries, make sure:

  • The LINEAGE table is set up in your warehouse. This table isn't part of the native GOLD namespace and must be created separately. See Set up lineage tables.

You can use these SQL examples with Snowflake or another Iceberg REST–compatible engine connected to your Lakehouse.

Assets without lineage

Identify assets that are disconnected from data pipelines - potentially orphaned tables, test assets, or deprecated resources consuming storage and compute resources without providing value. These are prime candidates for cleanup, archival, or investigation.

This query identifies active data assets (tables, views, materialized views) that have no lineage connections and haven't been updated recently. It calculates how long assets have been dormant and provides cleanup recommendations based on inactivity period, certification status, and ownership. The query helps you prioritize which assets to investigate, archive, or delete to reduce storage costs and platform clutter. Time-based filters focus on truly orphaned assets versus recently created assets that simply haven't been connected yet.

-- ============================================
-- ASSETS WITHOUT LINEAGE: Find Orphaned/Unused Assets for Cleanup
-- Assets not having any lineage and have not been updated or used in the last 90 days
-- gold.assets: Lakehouse catalog (GOLD namespace)
-- ============================================

SELECT
a.guid,
a.asset_name, -- Asset name
a.asset_type, -- Asset type (Table, View, etc.)
a.asset_qualified_name, -- Full path
a.connector_name, -- Data platform
a.description,
a.certificate_status,
a.status,
a.owner_users,
a.tags, -- Tags attached to asset

-- Confirm no lineage
a.has_lineage,

-- Metadata timestamps
TO_TIMESTAMP_LTZ(a.created_at / 1000) AS CREATED_DATE,
TO_TIMESTAMP_LTZ(a.updated_at / 1000) AS LAST_UPDATED,

-- Calculate age
DATEDIFF(
day,
TO_TIMESTAMP_LTZ(a.created_at / 1000),
CURRENT_TIMESTAMP()
) AS DAYS_SINCE_CREATION,

DATEDIFF(
day,
TO_TIMESTAMP_LTZ(a.updated_at / 1000),
CURRENT_TIMESTAMP()
) AS DAYS_SINCE_UPDATE,

-- Cleanup recommendation based on inactivity and metadata
CASE
WHEN a.certificate_status = 'DEPRECATED'
THEN '🔴 SAFE TO DELETE - Already deprecated'

WHEN DATEDIFF(
day,
TO_TIMESTAMP_LTZ(a.updated_at / 1000),
CURRENT_TIMESTAMP()
) > 180
AND a.status = 'ACTIVE'
THEN '🟡 REVIEW FOR DELETION - No activity in 6+ months'

WHEN DATEDIFF(
day,
TO_TIMESTAMP_LTZ(a.created_at / 1000),
CURRENT_TIMESTAMP()
) <= 7
THEN '🟢 KEEP - Recently created, may not be connected yet'

WHEN (a.owner_users IS NULL)
THEN '🟡 INVESTIGATE - No owner, likely test/temp asset'

ELSE '🟢 REVIEW - May be intentionally standalone'
END AS CLEANUP_RECOMMENDATION

FROM gold.assets a
WHERE
-- No lineage connections
(
a.has_lineage = FALSE

)
-- Only data assets
AND a.asset_type IN ('Table', 'View', 'MaterializedView')
AND a.status = 'ACTIVE'

-- Filter by connector (can specify multiple or single connector)
AND a.connector_name IN ('snowflake', 'redshift', 'bigquery', 'databricks')

-- Only assets not updated in last 90 days (adjust as needed)
AND DATEDIFF(
day,
TO_TIMESTAMP_LTZ(a.updated_at / 1000),
CURRENT_TIMESTAMP()
) > 90

ORDER BY
DAYS_SINCE_UPDATE DESC, -- Oldest first
a.asset_name;
Sample output
ASSET_NAMEASSET_TYPECONNECTOR_NAMEDAYS_SINCE_UPDATECLEANUP_RECOMMENDATION
ACCESSCONTROL_ENTITYTablesnowflake71REVIEW - May be intentionally standalone
ADFACTIVITY_ENTITYTablesnowflake71REVIEW - May be intentionally standalone
ADFDATAFLOW_ENTITYTablesnowflake71REVIEW - May be intentionally standalone
ADFDATASET_ENTITYTablesnowflake71REVIEW - May be intentionally standalone

Circular dependencies

Detect assets with circular lineage where data flows back to itself through intermediate steps. This can indicate recursive queries (intentional), design issues (unintentional), or performance problems that need investigation.

This query identifies assets that reference themselves in their lineage graph, creating circular dependencies. This can occur with recursive CTEs, self-referencing views, or unintentional design issues. The query shows the direction of the circular reference (upstream or downstream) and helps you understand whether it's an intentional recursive pattern or a data modeling issue that needs correction.

-- ============================================
-- CIRCULAR DEPENDENCIES: Find Assets with Lineage to Themselves
-- Detects self-referencing patterns that may indicate recursive queries or design issues
-- gold.assets: Lakehouse catalog (GOLD namespace)
-- LINEAGE: Customer-managed table (see "Set up lineage tables")
-- ============================================

SELECT
a.guid AS ASSET_GUID,
a.asset_name, -- Asset with circular reference
a.asset_type, -- Type (Table, View, etc.)
a.asset_qualified_name AS ASSET_PATH, -- Full path
a.connector_name,
a.owner_users,
a.certificate_status,
TO_TIMESTAMP_LTZ(a.updated_at / 1000) AS LAST_UPDATED,

-- Lineage relationship back to itself
l.direction AS LINEAGE_DIRECTION, -- UPSTREAM or DOWNSTREAM
l.level AS CIRCULAR_PATH_LENGTH, -- How many hops in the circle
l.related_name AS RELATED_ASSET_NAME,
l.related_type AS RELATED_ASSET_TYPE,

-- Flag potential issues
CASE
WHEN l.level = 1 THEN '⚠️ DIRECT SELF-REFERENCE - Review immediately'
WHEN l.level <= 3 THEN '⚠️ SHORT CIRCULAR PATH - May cause performance issues'
ELSE '⚠️ LONG CIRCULAR PATH - Complex dependency chain'
END AS CIRCULAR_DEPENDENCY_SEVERITY

FROM gold.assets a
INNER JOIN LINEAGE l
ON a.guid = l.start_guid -- Start from this asset

WHERE
-- The related asset in lineage is the same asset (circular)
l.related_guid = a.guid

-- Focus on specific asset types
AND a.asset_type IN ('Table', 'View', 'MaterializedView', 'DbtModel')

-- Focus on active assets
AND a.status = 'ACTIVE'

ORDER BY
l.level ASC, -- Direct circles first
a.asset_name;
Sample output
ASSET_NAMEASSET_TYPECONNECTOR_NAMELINEAGE_DIRECTIONCIRCULAR_PATH_LENGTHCIRCULAR_DEPENDENCY_SEVERITY
INSTACART_AISLESTableteradataDOWNSTREAM1DIRECT SELF-REFERENCE - Review immediately
INSTACART_AISLESTableteradataUPSTREAM1DIRECT SELF-REFERENCE - Review immediately
INSTACART_BEVERAGES_ORDER_CUSTOMERTablesnowflakeDOWNSTREAM1DIRECT SELF-REFERENCE - Review immediately
INSTACART_BEVERAGES_ORDER_CUSTOMERTablesnowflakeUPSTREAM1DIRECT SELF-REFERENCE - Review immediately
INSTACART_CUSTOMER_DETAILSTableteradataUPSTREAM1DIRECT SELF-REFERENCE - Review immediately
INSTACART_CUSTOMER_DETAILSTablesnowflakeDOWNSTREAM1DIRECT SELF-REFERENCE - Review immediately
INSTACART_DEPARTMENTSTableteradataUPSTREAM1DIRECT SELF-REFERENCE - Review immediately

Unexpected asset archival

Investigate sudden lineage drops or assets disappearing from Atlan. When assets are archived (soft-deleted), any lineage edges connected to them disappear, which can cause unexpected gaps in your lineage graph. These queries help you identify which assets were archived, when, by whom, and how the archival is distributed across connectors and asset types—so you can determine whether the archival was expected or warrants further investigation.

Use these queries when:

  • Lineage edges or downstream impact analysis suddenly shows fewer connections than expected.
  • A user reports that previously visible assets are no longer searchable in Atlan.
  • A workflow run summary reports an unexpectedly high archived count.
  • You suspect a recent permission update, configuration change, or include/exclude filter change has caused over-aggressive archival.

List archived assets in last 24 hours

Lists every asset archived in the last 24 hours along with its identity, connector, owner, certification status, and the user that archived it. Use this query to confirm what was archived, identify the actor responsible (a user, a workflow service account, or a connector), and decide whether the archival was expected. The 24-hour window can be adjusted by changing the DATEADD interval—widen it when investigating older drops, narrow it when correlating with a specific workflow run.

-- ============================================
-- ARCHIVED ASSETS: List Assets Archived in the Last 24 Hours
-- Helps identify unexpected archival events that cause lineage drops
-- gold.assets: Lakehouse catalog (GOLD namespace)
-- ============================================

SELECT
a.guid,
a.asset_type,
a.asset_name,
a.asset_qualified_name,
a.connector_name,
a.connector_qualified_name,
a.updated_by AS archived_by, -- User or workflow that archived the asset
TO_TIMESTAMP_LTZ(a.updated_at / 1000) AS archived_at,
a.owner_users,
a.certificate_status

FROM <YOUR_LAKEHOUSE_DB>.GOLD.ASSETS a

WHERE
a.status = 'DELETED' -- Archived (soft-deleted) assets only
AND TO_TIMESTAMP_LTZ(a.updated_at / 1000)
>= DATEADD('hour', -24, CURRENT_TIMESTAMP()) -- Adjust window as needed

ORDER BY
archived_at DESC;

Count archived assets by connection and asset type

Aggregates archival events by connector, asset type, and the user that archived them. Use this query to spot patterns—for example, a single connector accounting for most of the archival, a single user or workflow archiving thousands of assets, or a specific asset type being disproportionately affected. The first and last archival timestamps for each group help you correlate the event with a specific workflow run or admin action.

-- ============================================
-- ARCHIVED ASSETS BY CONNECTION: Aggregate Archival by Connector and Asset Type
-- Spot which connectors, asset types, or users are driving lineage drops
-- gold.assets: Lakehouse catalog (GOLD namespace)
-- ============================================

SELECT
a.connector_name,
a.connector_qualified_name,
a.asset_type,
a.updated_by AS archived_by,
COUNT(*) AS archived_count,
MIN(TO_TIMESTAMP_LTZ(a.updated_at / 1000)) AS first_archived_at,
MAX(TO_TIMESTAMP_LTZ(a.updated_at / 1000)) AS last_archived_at

FROM <YOUR_LAKEHOUSE_DB>.GOLD.ASSETS a

WHERE
a.status = 'DELETED'
AND TO_TIMESTAMP_LTZ(a.updated_at / 1000)
>= DATEADD('hour', -24, CURRENT_TIMESTAMP()) -- Adjust window as needed

GROUP BY
a.connector_name,
a.connector_qualified_name,
a.asset_type,
a.updated_by

ORDER BY
archived_count DESC;

Diagnostic workflow

Once you have the results from the queries in the previous section, follow these steps to determine whether the archival was expected and what caused it:

  1. Review the archived count from the workflow run summary. Open the most recent crawl or miner workflow run for the affected connector and check the archived asset count. Compare it to historical runs to confirm the spike.
  2. Filter assets in Atlan using the is archived property. Use the asset listing or search to confirm which specific assets are archived and cross-check them with the SQL results.
  3. Validate whether the archival was expected. Confirm with the connection owner or data platform team whether the archived assets were intentionally dropped at the source (table deletions, schema migrations, decommissioned objects).
  4. Check for permission updates, configuration changes, or include/exclude filter changes. If the archival is unexpected, the most common causes are:
    • Permission changes at the source that hide tables or schemas from the connector's service account, causing the connector to treat them as deleted.
    • Include/exclude filter changes in the connection configuration that narrow the crawled scope, causing previously crawled assets to be archived on the next run.
    • Connection or credential changes that point the connector at a different database or schema.
    • Source system changes (renamed databases, dropped schemas) that legitimately make the assets unreachable.

If none of these causes apply, raise a support ticket with the workflow run ID, the affected connector, and the query results so engineering can investigate further.


Lineage coverage & analytics

These queries provide high-level metrics and analytics about lineage coverage across your data platform. These metrics are essential for measuring data governance maturity and identifying gaps in lineage capture. They help answer questions like:

  • What percentage of assets have lineage?
  • Which connectors have the best lineage coverage?
  • How many source vs sink vs process assets exist?

Overall lineage coverage summary

Provides a high-level dashboard view of lineage coverage across the entire data platform. Calculates the overall percentage of assets that have lineage information and breaks down assets by their role in data pipelines (sources that produce data, transformers that process data, and sinks that consume data).

-- ============================================
-- LINEAGE COVERAGE SUMMARY: Overall Platform Metrics
-- Shows percentage of assets with lineage and breakdown by status
-- gold.assets: Lakehouse catalog (GOLD namespace)
-- ============================================
SELECT
COUNT(*) AS TOTAL_ASSETS,

-- Assets with lineage
SUM(CASE WHEN has_lineage = TRUE THEN 1 ELSE 0 END) AS ASSETS_WITH_LINEAGE,
SUM(CASE WHEN has_lineage = FALSE THEN 1 ELSE 0 END) AS ASSETS_WITHOUT_LINEAGE,

-- Calculate percentages
ROUND(
(SUM(CASE WHEN has_lineage = TRUE THEN 1 ELSE 0 END) * 100.0) / COUNT(*),
2
) AS LINEAGE_COVERAGE_PCT

FROM gold.assets
WHERE
status = 'ACTIVE'; -- Only active assets
Sample output
TOTAL_ASSETSASSETS_WITH_LINEAGEASSETS_WITHOUT_LINEAGELINEAGE_COVERAGE_PCT
91234836944953516040.49

Lineage coverage by connector

Compares lineage coverage across different data platforms (Snowflake, BigQuery, Tableau, etc.) to identify which systems have strong lineage capture and which need improvement. Shows both the percentage of assets with lineage and the average number of connections per asset, helping you understand data flow density. Use this to prioritize which connectors need better lineage instrumentation or to validate that lineage is working correctly after new connector deployments.

-- ============================================
-- LINEAGE COVERAGE BY CONNECTOR: Platform-Specific Metrics
-- Compare lineage coverage across different data platforms
-- gold.assets: Lakehouse catalog (GOLD namespace)
-- ============================================
SELECT
connector_name,
asset_type,

COUNT(*) AS TOTAL_ASSETS,
SUM(CASE WHEN has_lineage = TRUE THEN 1 ELSE 0 END) AS WITH_LINEAGE,
SUM(CASE WHEN has_lineage = FALSE THEN 1 ELSE 0 END) AS WITHOUT_LINEAGE,

-- Lineage coverage percentage
ROUND(
(SUM(CASE WHEN has_lineage = TRUE THEN 1 ELSE 0 END) * 100.0) / COUNT(*),
2
) AS LINEAGE_COVERAGE_PCT

FROM gold.assets
WHERE
status = 'ACTIVE'
AND NULLIF(connector_name,'') IS NOT NULL
GROUP BY
connector_name,
asset_type
ORDER BY
connector_name,
LINEAGE_COVERAGE_PCT DESC,
TOTAL_ASSETS DESC;
Sample output
CONNECTOR_NAMEASSET_TYPETOTAL_ASSETSWITH_LINEAGEWITHOUT_LINEAGELINEAGE_COVERAGE_PCT
SqoopConnection1010.0
adlsADLSObject83537.5
adlsADLSContainer2020.0
adlsADLSAccount2020.0
adlsConnection2020.0
aiAIModel110100.0
airflowAirflowTask31310100.0
airflowColumnProcess5045590.0
airflowProcess4236685.71
airflowAirflowDag6060.0

Identify lineage hubs

Identifies the "hub" tables and views in your data ecosystem, those with the highest number of upstream and downstream connections. These are your most critical assets because they have the widest blast radius if something goes wrong. The query categorizes each asset by criticality based on downstream dependency count (20+ is CRITICAL, 10-19 is HIGH, etc.). Use this to prioritize which assets need the most rigorous data quality monitoring, change management processes, and documentation. These are the tables you must protect from breaking changes.

-- ============================================
-- MOST CONNECTED ASSETS: Identify Critical Lineage Hubs
-- Find tables/views with highest number of dependencies by counting connections from LINEAGE table
-- gold.assets: Lakehouse catalog (GOLD namespace)
-- LINEAGE: Customer-managed table (see "Set up lineage tables")
-- ============================================

SELECT
a.asset_name,
a.asset_type,
a.asset_qualified_name,
a.connector_name,
a.certificate_status,
a.owner_users,
TO_TIMESTAMP_LTZ(a.updated_at / 1000) AS LAST_UPDATED,
DATEDIFF(day, TO_TIMESTAMP_LTZ(a.updated_at / 1000), CURRENT_TIMESTAMP()) AS DAYS_SINCE_UPDATE,

-- Count upstream connections (direct sources feeding this asset)
COUNT(CASE WHEN l.direction = 'UPSTREAM' AND l.level = 1 THEN 1 END) AS UPSTREAM_COUNT,

-- Count downstream connections (direct consumers using this asset)
COUNT(CASE WHEN l.direction = 'DOWNSTREAM' AND l.level = 1 THEN 1 END) AS DOWNSTREAM_COUNT,

-- Total direct connections (both upstream and downstream)
COUNT(CASE WHEN l.level = 1 THEN 1 END) AS TOTAL_CONNECTIONS,

-- Classify criticality based on downstream dependencies (blast radius)
CASE
WHEN COUNT(CASE WHEN l.direction = 'DOWNSTREAM' AND l.level = 1 THEN 1 END) >= 20
THEN '🔴 CRITICAL - 20+ downstream dependencies'
WHEN COUNT(CASE WHEN l.direction = 'DOWNSTREAM' AND l.level = 1 THEN 1 END) >= 10
THEN '🟠 HIGH IMPACT - 10-19 downstream dependencies'
WHEN COUNT(CASE WHEN l.direction = 'DOWNSTREAM' AND l.level = 1 THEN 1 END) >= 5
THEN '🟡 MODERATE IMPACT - 5-9 downstream dependencies'
ELSE '🟢 LOW IMPACT - <5 downstream dependencies'
END AS CRITICALITY_LEVEL

FROM gold.assets a
LEFT JOIN LINEAGE l
ON a.guid = l.start_guid -- Join to get all lineage connections for this asset

WHERE
a.has_lineage = TRUE
AND a.asset_type IN ('Table', 'View', 'MaterializedView')
AND a.status = 'ACTIVE'

GROUP BY
a.asset_name,
a.asset_type,
a.asset_qualified_name,
a.connector_name,
a.certificate_status,
a.owner_users,
a.updated_at -- Need to group by updated_at to use in DATEDIFF

HAVING
COUNT(CASE WHEN l.level = 1 THEN 1 END) > 0 -- Only assets with at least one direct connection

ORDER BY
TOTAL_CONNECTIONS DESC;
Sample output
ASSET_NAMEASSET_TYPECONNECTOR_NAMEUPSTREAM_COUNTDOWNSTREAM_COUNTTOTAL_CONNECTIONSCRITICALITY_LEVEL
ORDERS_DENORMALIZEDTablesnowflake31619HIGH IMPACT - 10-19 downstream dependencies
FACT_ORDERSTablesnowflake21416HIGH IMPACT - 10-19 downstream dependencies
fct_salesTablebigquery8816MODERATE IMPACT - 5-9 downstream dependencies
FCT_SALESTablesnowflake8715MODERATE IMPACT - 5-9 downstream dependencies
DIM_CUSTOMERTablesnowflake31215HIGH IMPACT - 10-19 downstream dependencies
dim_employeeTabledatabricks11213HIGH IMPACT - 10-19 downstream dependencies
peopleTablepostgres01313HIGH IMPACT - 10-19 downstream dependencies
dim_customerTableredshift31013HIGH IMPACT - 10-19 downstream dependencies
dim_customerTabledatabricks3912MODERATE IMPACT - 5-9 downstream dependencies
DIM_STOCKITEMTablesnowflake3912MODERATE IMPACT - 5-9 downstream dependencies