Extract lineage and usage from Databricks
Retrieve lineage from Unity Catalog and usage and popularity metrics from query history or system tables. Both Atlan and Databricks strongly recommend using the system tables method.
Prerequisites
Before you begin, make sure you have:
- Set up authentication using personal access token, AWS service principal, or Azure service principal
- Crawled assets from Databricks
- Reviewed the order of operations to understand workflow sequencing
- Unity Catalog enabled on your Databricks workspace for lineage and usage metrics extraction. You may also need to upgrade existing tables and views to Unity Catalog and reach out to your Databricks account executive to enable lineage in Unity Catalog.
Create extraction workflow
To extract lineage and usage from Databricks:
- In the top right of any screen, navigate to New and then click New Workflow.
- From the filters along the top, click Miner.
- From the list of packages, select Databricks Miner and click Setup Workflow.
Configure lineage extraction
- REST API
- Offline
- System Table
Atlan connects to your database and extracts lineage directly.
- For Connection, select the connection to extract. The crawler must have already run for you to select a connection.
- Click Next.
Extract lineage using the offline extraction method and make it available in S3.
- For Connection, select the connection to extract. The crawler must have already run for you to select a connection.
- For Bucket name, enter the name of your S3 bucket.
- For Bucket prefix, enter the S3 prefix under which all the metadata files exist. These include
extracted-lineage/result-0.json
andextracted-query-history/result-0.json
. - For Bucket region, enter the name of the S3 region.
- Click Next.
Atlan connects to your database and queries system tables to extract lineage directly.
-
For Connection, select the connection to extract. The crawler must have already run for you to select a connection.
-
For Extraction Catalog Type, choose one of the following:
-
Default: Select to fetch lineage from the system catalog and
access
schema. -
Cloned_catalog: Select to fetch lineage from a cloned catalog and schema. Before proceeding, make sure the following prerequisites are met:
- You have already created cloned views named
column_lineage
andtable_lineage
in your schema. If not, follow the steps in Create cloned views of system tables. - The
atlan-user
must haveSELECT
permissions on both views to access lineage data.
Then, provide values for the following fields:
- Cloned Catalog Name: Catalog containing the cloned views.
- Cloned Schema Name: Schema containing the cloned views.
- You have already created cloned views named
-
-
For SQL Warehouse ID, enter the ID you copied from your SQL warehouse.
-
If you want to enable lineage tracking at the file path level for volumes or external locations, enable File Path Lineage.
-
Click Next.
Configure usage extraction
Query history is currently in public preview for Databricks.
Atlan extracts usage and popularity metrics from query history or system tables. This feature is currently limited to queries on SQL warehouses; queries on interactive clusters aren't supported. Additionally, expensive queries and compute costs for Databricks assets are currently unavailable due to limitations of the Databricks APIs.
- For Fetch Query History and Calculate Popularity, click Yes to retrieve usage and popularity metrics for your Databricks assets. To skip this step, click No and proceed to Run extractor.
- For Popularity Extraction Method, choose one of the following:
- REST API: Extract usage and popularity metrics from query history.
- System table: Extract metrics directly from system tables. Then configure:
- Extraction catalog type for popularity: Choose where to fetch popularity data from:
- Default: Uses the system catalog and
query
schema to fetch popularity metrics. - Cloned_catalog: Select to fetch popularity from cloned views in a separate catalog and schema. Before proceeding:
- The
query_history
view must exist in the provided schema. - The
atlan-user
must haveSELECT
permission on the view. Then provide: - Cloned Catalog Name: The catalog that contains the
query_history
view. - Cloned Schema Name: The schema that contains the
query_history
view. For more information, see Create cloned views of system tables.
- The
- Default: Uses the system catalog and
- SQL Warehouse ID: Enter the ID you copied from your SQL warehouse.
- Extraction catalog type for popularity: Choose where to fetch popularity data from:
Configure usage settings
- For Popularity Window (days), enter the number of days to include. The maximum limit is 30 days.
- For Start time, choose the earliest date from which to mine query history. If you're using the offline extraction method to extract query history from Databricks, skip this field. If running the miner for the first time, Atlan recommends setting a start date around three days prior to the current date and then scheduling it daily to build up to two weeks of query history. Mining two weeks of query history on the first miner run may cause delays. For all subsequent runs, Atlan requires a minimum lag of 24 to 48 hours to capture all the relevant transformations that were part of a session. Learn more about the miner logic at Troubleshooting usage and popularity metrics.
- For Excluded Users, type the names of users to be excluded while calculating usage metrics for Databricks assets. Press
enter
after each name to add more names.
Run extractor
- To check for any permissions or other configuration issues before running the extractor, click Preflight checks. This is currently only supported when using REST API and offline extraction methods. If you're using system tables, skip to step 2.
- Choose how to run the extractor:
- To run the extractor once immediately, at the bottom of the screen, click Run.
- To schedule the extractor to run hourly, daily, weekly, or monthly, at the bottom of the screen, click Schedule Run.
Once the extractor completes, you can view lineage for Databricks assets.
See also
- Extract on-premises Databricks lineage: Extract lineage from on-premises Databricks instances using the databricks-extractor tool and upload results to S3.
- What does Atlan crawl from Databricks: Reference documentation for all metadata assets, attributes, and relationships extracted by the Databricks connector.