Skip to main content

Lineage analysis

This document provides SQL queries for lineage analysis scenarios in Atlan using Lakehouse (MDLH).

This page provides SQL templates and guidance for lineage analysis using the GOLD.ASSET_LOOKUP_TABLE and GOLD.FULL_LINEAGE tables in your Lakehouse.

Before you begin

Before you run these queries, make sure:

You can use these SQL examples with Snowflake or another Iceberg REST–compatible engine connected to your Lakehouse.

Assets without lineage

Identify assets that are disconnected from data pipelines - potentially orphaned tables, test assets, or deprecated resources consuming storage and compute resources without providing value. These are prime candidates for cleanup, archival, or investigation.

This query identifies active data assets (tables, views, materialized views) that have no lineage connections and haven't been updated recently. It calculates how long assets have been dormant and provides cleanup recommendations based on inactivity period, certification status, and ownership. The query helps you prioritize which assets to investigate, archive, or delete to reduce storage costs and platform clutter. Time-based filters focus on truly orphaned assets versus recently created assets that simply haven't been connected yet.

-- ============================================
-- ASSETS WITHOUT LINEAGE: Find Orphaned/Unused Assets for Cleanup
-- Assets not having any lineage and have not been updated or used in the last 90 days
-- ============================================

SELECT
A.GUID,
A.ASSET_NAME, -- Asset name
A.ASSET_TYPE, -- Asset type (Table, View, etc.)
A.ASSET_QUALIFIED_NAME, -- Full path
A.CONNECTOR_NAME, -- Data platform
A.DESCRIPTION,
A.CERTIFICATE_STATUS,
A.STATUS,
A.OWNER_USERS,
A.TAGS, -- Tags attached to asset

-- Confirm no lineage
A.HAS_LINEAGE,

-- Metadata timestamps
TO_TIMESTAMP_LTZ(A.CREATED_AT / 1000) AS CREATED_DATE,
TO_TIMESTAMP_LTZ(A.UPDATED_AT / 1000) AS LAST_UPDATED,

-- Calculate age
DATEDIFF(
day,
TO_TIMESTAMP_LTZ(A.CREATED_AT / 1000),
CURRENT_TIMESTAMP()
) AS DAYS_SINCE_CREATION,

DATEDIFF(
day,
TO_TIMESTAMP_LTZ(A.UPDATED_AT / 1000),
CURRENT_TIMESTAMP()
) AS DAYS_SINCE_UPDATE,

-- Cleanup recommendation based on inactivity and metadata
CASE
WHEN A.CERTIFICATE_STATUS = 'DEPRECATED'
THEN '🔴 SAFE TO DELETE - Already deprecated'

WHEN DATEDIFF(
day,
TO_TIMESTAMP_LTZ(A.UPDATED_AT / 1000),
CURRENT_TIMESTAMP()
) > 180
AND A.STATUS = 'ACTIVE'
THEN '🟡 REVIEW FOR DELETION - No activity in 6+ months'

WHEN DATEDIFF(
day,
TO_TIMESTAMP_LTZ(A.CREATED_AT / 1000),
CURRENT_TIMESTAMP()
) <= 7
THEN '🟢 KEEP - Recently created, may not be connected yet'

WHEN (A.OWNER_USERS IS NULL)
THEN '🟡 INVESTIGATE - No owner, likely test/temp asset'

ELSE '🟢 REVIEW - May be intentionally standalone'
END AS CLEANUP_RECOMMENDATION

FROM ASSETS A
WHERE
-- No lineage connections
(
A.HAS_LINEAGE = FALSE

)
-- Only data assets
AND A.ASSET_TYPE IN ('Table', 'View', 'MaterializedView')
AND A.STATUS = 'ACTIVE'

-- Filter by connector (can specify multiple or single connector)
AND A.CONNECTOR_NAME IN ('snowflake', 'redshift', 'bigquery', 'databricks')

-- Only assets not updated in last 90 days (adjust as needed)
AND DATEDIFF(
day,
TO_TIMESTAMP_LTZ(A.UPDATED_AT / 1000),
CURRENT_TIMESTAMP()
) > 90

ORDER BY
DAYS_SINCE_UPDATE DESC, -- Oldest first
A.ASSET_NAME;
Sample output
ASSET_NAMEASSET_TYPECONNECTOR_NAMEDAYS_SINCE_UPDATECLEANUP_RECOMMENDATION
ACCESSCONTROL_ENTITYTablesnowflake71REVIEW - May be intentionally standalone
ADFACTIVITY_ENTITYTablesnowflake71REVIEW - May be intentionally standalone
ADFDATAFLOW_ENTITYTablesnowflake71REVIEW - May be intentionally standalone
ADFDATASET_ENTITYTablesnowflake71REVIEW - May be intentionally standalone

Circular dependencies

Detect assets with circular lineage where data flows back to itself through intermediate steps. This can indicate recursive queries (intentional), design issues (unintentional), or performance problems that need investigation.

This query identifies assets that reference themselves in their lineage graph, creating circular dependencies. This can occur with recursive CTEs, self-referencing views, or unintentional design issues. The query shows the direction of the circular reference (upstream or downstream) and helps you understand whether it's an intentional recursive pattern or a data modeling issue that needs correction.

USE ATLAN_GOLD;

-- ============================================
-- CIRCULAR DEPENDENCIES: Find Assets with Lineage to Themselves
-- Detects self-referencing patterns that may indicate recursive queries or design issues
-- ============================================

SELECT
A.GUID AS ASSET_GUID,
A.ASSET_NAME, -- Asset with circular reference
A.ASSET_TYPE, -- Type (Table, View, etc.)
A.ASSET_QUALIFIED_NAME AS ASSET_PATH, -- Full path
A.CONNECTOR_NAME,
A.OWNER_USERS,
A.CERTIFICATE_STATUS,
TO_TIMESTAMP_LTZ(A.UPDATED_AT / 1000) AS LAST_UPDATED,

-- Lineage relationship back to itself
L.DIRECTION AS LINEAGE_DIRECTION, -- UPSTREAM or DOWNSTREAM
L.LEVEL AS CIRCULAR_PATH_LENGTH, -- How many hops in the circle
L.RELATED_NAME AS RELATED_ASSET_NAME,
L.RELATED_TYPE AS RELATED_ASSET_TYPE,

-- Flag potential issues
CASE
WHEN L.LEVEL = 1 THEN '⚠️ DIRECT SELF-REFERENCE - Review immediately'
WHEN L.LEVEL <= 3 THEN '⚠️ SHORT CIRCULAR PATH - May cause performance issues'
ELSE '⚠️ LONG CIRCULAR PATH - Complex dependency chain'
END AS CIRCULAR_DEPENDENCY_SEVERITY

FROM ASSETS A
INNER JOIN LINEAGE L
ON A.GUID = L.START_GUID -- Start from this asset

WHERE
-- The related asset in lineage is the same asset (circular)
L.RELATED_GUID = A.GUID

-- Focus on specific asset types
AND A.ASSET_TYPE IN ('Table', 'View', 'MaterializedView', 'DbtModel')

-- Focus on active assets
AND A.STATUS = 'ACTIVE'

ORDER BY
L.LEVEL ASC, -- Direct circles first
A.ASSET_NAME;
Sample output
ASSET_NAMEASSET_TYPECONNECTOR_NAMELINEAGE_DIRECTIONCIRCULAR_PATH_LENGTHCIRCULAR_DEPENDENCY_SEVERITY
INSTACART_AISLESTableteradataDOWNSTREAM1DIRECT SELF-REFERENCE - Review immediately
INSTACART_AISLESTableteradataUPSTREAM1DIRECT SELF-REFERENCE - Review immediately
INSTACART_BEVERAGES_ORDER_CUSTOMERTablesnowflakeDOWNSTREAM1DIRECT SELF-REFERENCE - Review immediately
INSTACART_BEVERAGES_ORDER_CUSTOMERTablesnowflakeUPSTREAM1DIRECT SELF-REFERENCE - Review immediately
INSTACART_CUSTOMER_DETAILSTableteradataUPSTREAM1DIRECT SELF-REFERENCE - Review immediately
INSTACART_CUSTOMER_DETAILSTablesnowflakeDOWNSTREAM1DIRECT SELF-REFERENCE - Review immediately
INSTACART_DEPARTMENTSTableteradataUPSTREAM1DIRECT SELF-REFERENCE - Review immediately

Lineage coverage & analytics

These queries provide high-level metrics and analytics about lineage coverage across your data platform. These metrics are essential for measuring data governance maturity and identifying gaps in lineage capture. They help answer questions like:

  • What percentage of assets have lineage?
  • Which connectors have the best lineage coverage?
  • How many source vs sink vs process assets exist?

Overall lineage coverage summary

Provides a high-level dashboard view of lineage coverage across the entire data platform. Calculates the overall percentage of assets that have lineage information and breaks down assets by their role in data pipelines (sources that produce data, transformers that process data, and sinks that consume data).

-- ============================================
-- LINEAGE COVERAGE SUMMARY: Overall Platform Metrics
-- Shows percentage of assets with lineage and breakdown by status
-- ============================================
SELECT
COUNT(*) AS TOTAL_ASSETS,

-- Assets with lineage
SUM(CASE WHEN HAS_LINEAGE = TRUE THEN 1 ELSE 0 END) AS ASSETS_WITH_LINEAGE,
SUM(CASE WHEN HAS_LINEAGE = FALSE THEN 1 ELSE 0 END) AS ASSETS_WITHOUT_LINEAGE,

-- Calculate percentages
ROUND(
(SUM(CASE WHEN HAS_LINEAGE = TRUE THEN 1 ELSE 0 END) * 100.0) / COUNT(*),
2
) AS LINEAGE_COVERAGE_PCT

FROM ASSETS
WHERE
STATUS = 'ACTIVE'; -- Only active assets
Sample output
TOTAL_ASSETSASSETS_WITH_LINEAGEASSETS_WITHOUT_LINEAGELINEAGE_COVERAGE_PCT
91234836944953516040.49

Lineage coverage by connector

Compares lineage coverage across different data platforms (Snowflake, BigQuery, Tableau, etc.) to identify which systems have strong lineage capture and which need improvement. Shows both the percentage of assets with lineage and the average number of connections per asset, helping you understand data flow density. Use this to prioritize which connectors need better lineage instrumentation or to validate that lineage is working correctly after new connector deployments.

-- ============================================
-- LINEAGE COVERAGE BY CONNECTOR: Platform-Specific Metrics
-- Compare lineage coverage across different data platforms
-- ============================================
SELECT
CONNECTOR_NAME,
ASSET_TYPE,

COUNT(*) AS TOTAL_ASSETS,
SUM(CASE WHEN HAS_LINEAGE = TRUE THEN 1 ELSE 0 END) AS WITH_LINEAGE,
SUM(CASE WHEN HAS_LINEAGE = FALSE THEN 1 ELSE 0 END) AS WITHOUT_LINEAGE,

-- Lineage coverage percentage
ROUND(
(SUM(CASE WHEN HAS_LINEAGE = TRUE THEN 1 ELSE 0 END) * 100.0) / COUNT(*),
2
) AS LINEAGE_COVERAGE_PCT

FROM ASSETS
WHERE
STATUS = 'ACTIVE'
AND NULLIF(CONNECTOR_NAME,'') IS NOT NULL
GROUP BY
CONNECTOR_NAME,
ASSET_TYPE
ORDER BY
CONNECTOR_NAME,
LINEAGE_COVERAGE_PCT DESC,
TOTAL_ASSETS DESC;
Sample output
CONNECTOR_NAMEASSET_TYPETOTAL_ASSETSWITH_LINEAGEWITHOUT_LINEAGELINEAGE_COVERAGE_PCT
SqoopConnection1010.0
adlsADLSObject83537.5
adlsADLSContainer2020.0
adlsADLSAccount2020.0
adlsConnection2020.0
aiAIModel110100.0
airflowAirflowTask31310100.0
airflowColumnProcess5045590.0
airflowProcess4236685.71
airflowAirflowDag6060.0

Identify lineage hubs

Identifies the "hub" tables and views in your data ecosystem, those with the highest number of upstream and downstream connections. These are your most critical assets because they have the widest blast radius if something goes wrong. The query categorizes each asset by criticality based on downstream dependency count (20+ is CRITICAL, 10-19 is HIGH, etc.). Use this to prioritize which assets need the most rigorous data quality monitoring, change management processes, and documentation. These are the tables you must protect from breaking changes.

-- ============================================
-- MOST CONNECTED ASSETS: Identify Critical Lineage Hubs
-- Find tables/views with highest number of dependencies by counting connections from LINEAGE table
-- ============================================

SELECT
A.ASSET_NAME,
A.ASSET_TYPE,
A.ASSET_QUALIFIED_NAME,
A.CONNECTOR_NAME,
A.CERTIFICATE_STATUS,
A.OWNER_USERS,
TO_TIMESTAMP_LTZ(A.UPDATED_AT / 1000) AS LAST_UPDATED,
DATEDIFF(day, TO_TIMESTAMP_LTZ(A.UPDATED_AT / 1000), CURRENT_TIMESTAMP()) AS DAYS_SINCE_UPDATE,

-- Count upstream connections (direct sources feeding this asset)
COUNT(CASE WHEN L.DIRECTION = 'UPSTREAM' AND L.LEVEL = 1 THEN 1 END) AS UPSTREAM_COUNT,

-- Count downstream connections (direct consumers using this asset)
COUNT(CASE WHEN L.DIRECTION = 'DOWNSTREAM' AND L.LEVEL = 1 THEN 1 END) AS DOWNSTREAM_COUNT,

-- Total direct connections (both upstream and downstream)
COUNT(CASE WHEN L.LEVEL = 1 THEN 1 END) AS TOTAL_CONNECTIONS,

-- Classify criticality based on downstream dependencies (blast radius)
CASE
WHEN COUNT(CASE WHEN L.DIRECTION = 'DOWNSTREAM' AND L.LEVEL = 1 THEN 1 END) >= 20
THEN '🔴 CRITICAL - 20+ downstream dependencies'
WHEN COUNT(CASE WHEN L.DIRECTION = 'DOWNSTREAM' AND L.LEVEL = 1 THEN 1 END) >= 10
THEN '🟠 HIGH IMPACT - 10-19 downstream dependencies'
WHEN COUNT(CASE WHEN L.DIRECTION = 'DOWNSTREAM' AND L.LEVEL = 1 THEN 1 END) >= 5
THEN '🟡 MODERATE IMPACT - 5-9 downstream dependencies'
ELSE '🟢 LOW IMPACT - <5 downstream dependencies'
END AS CRITICALITY_LEVEL

FROM ASSETS A
LEFT JOIN LINEAGE L
ON A.GUID = L.START_GUID -- Join to get all lineage connections for this asset

WHERE
A.HAS_LINEAGE = TRUE
AND A.ASSET_TYPE IN ('Table', 'View', 'MaterializedView')
AND A.STATUS = 'ACTIVE'

GROUP BY
A.ASSET_NAME,
A.ASSET_TYPE,
A.ASSET_QUALIFIED_NAME,
A.CONNECTOR_NAME,
A.CERTIFICATE_STATUS,
A.OWNER_USERS,
A.UPDATED_AT -- Need to group by UPDATED_AT to use in DATEDIFF

HAVING
COUNT(CASE WHEN L.LEVEL = 1 THEN 1 END) > 0 -- Only assets with at least one direct connection

ORDER BY
TOTAL_CONNECTIONS DESC;
Sample output
ASSET_NAMEASSET_TYPECONNECTOR_NAMEUPSTREAM_COUNTDOWNSTREAM_COUNTTOTAL_CONNECTIONSCRITICALITY_LEVEL
ORDERS_DENORMALIZEDTablesnowflake31619HIGH IMPACT - 10-19 downstream dependencies
FACT_ORDERSTablesnowflake21416HIGH IMPACT - 10-19 downstream dependencies
fct_salesTablebigquery8816MODERATE IMPACT - 5-9 downstream dependencies
FCT_SALESTablesnowflake8715MODERATE IMPACT - 5-9 downstream dependencies
DIM_CUSTOMERTablesnowflake31215HIGH IMPACT - 10-19 downstream dependencies
dim_employeeTabledatabricks11213HIGH IMPACT - 10-19 downstream dependencies
peopleTablepostgres01313HIGH IMPACT - 10-19 downstream dependencies
dim_customerTableredshift31013HIGH IMPACT - 10-19 downstream dependencies
dim_customerTabledatabricks3912MODERATE IMPACT - 5-9 downstream dependencies
DIM_STOCKITEMTablesnowflake3912MODERATE IMPACT - 5-9 downstream dependencies