Extract lineage and usage from Databricks

Retrieve lineage from Unity Catalog and usage and popularity metrics from query history or system tables. Both Atlan and Databricks strongly recommend using the system tables method.

Prerequisites

Before you begin, make sure you have:

Set up authentication using personal access token, AWS service principal, or Azure service principal
Crawled assets from Databricks
Reviewed the order of operations to understand workflow sequencing
Unity Catalog enabled on your Databricks workspace for lineage and usage metrics extraction. You may also need to upgrade existing tables and views to Unity Catalog and reach out to your Databricks account executive to enable lineage in Unity Catalog.

Create extraction workflow

To extract lineage and usage from Databricks:

In the top right of any screen, navigate to New and then click New Workflow.
From the filters along the top, click Miner.
From the list of packages, select Databricks Miner and click Setup Workflow.

Configure lineage extraction

REST API
Offline
System Table

Atlan connects to your database and extracts lineage directly.

For Connection, select the connection to extract. The crawler must have already run for you to select a connection.
Click Next.

Extract lineage using the offline extraction method and make it available in S3.

For Connection, select the connection to extract. The crawler must have already run for you to select a connection.
For Bucket name, enter the name of your S3 bucket.
For Bucket prefix, enter the S3 prefix under which all the metadata files exist. These include extracted-lineage/result-0.json and extracted-query-history/result-0.json.
For Bucket region, enter the name of the S3 region.
Click Next.

Atlan connects to your database and queries system tables to extract lineage directly.

For Connection, select the connection to extract. The crawler must have already run for you to select a connection.
For Extraction Catalog Type, choose one of the following:
- Default: Select to fetch lineage from the system catalog and access schema.
- Cloned_catalog: Select to fetch lineage from a cloned catalog and schema. Before proceeding, make sure the following prerequisites are met:
  - You have already created cloned views named column_lineage and table_lineage in your schema. If not, follow the steps in Create cloned views of system tables.
  - The atlan-user must have SELECT permissions on both views to access lineage data.
  Then, provide values for the following fields:
  - Cloned Catalog Name: Catalog containing the cloned views.
  - Cloned Schema Name: Schema containing the cloned views.
For SQL Warehouse ID, enter the ID you copied from your SQL warehouse.
If you want to enable lineage tracking at the file path level for volumes or external locations, enable File Path Lineage.
Click Next.

Configure usage extraction

Important!

Query history is currently in public preview for Databricks.

Atlan extracts usage and popularity metrics from query history or system tables. This feature is currently limited to queries on SQL warehouses; queries on interactive clusters aren't supported. Additionally, expensive queries and compute costs for Databricks assets are currently unavailable due to limitations of the Databricks APIs.

For Fetch Query History and Calculate Popularity, click Yes to retrieve usage and popularity metrics for your Databricks assets. To skip this step, click No and proceed to Run extractor.
For Popularity Extraction Method, choose one of the following:
- REST API: Extract usage and popularity metrics from query history.
- System table: Extract metrics directly from system tables. Then configure:
  - Extraction catalog type for popularity: Choose where to fetch popularity data from:
    - Default: Uses the system catalog and query schema to fetch popularity metrics.
    - Cloned_catalog: Select to fetch popularity from cloned views in a separate catalog and schema. Before proceeding:
      - The query_history view must exist in the provided schema.
      - The atlan-user must have SELECT permission on the view. Then provide:
      - Cloned Catalog Name: The catalog that contains the query_history view.
      - Cloned Schema Name: The schema that contains the query_history view. For more information, see Create cloned views of system tables.
  - SQL Warehouse ID: Enter the ID you copied from your SQL warehouse.

Configure usage settings

For Popularity Window (days), enter the number of days to include. The maximum limit is 30 days.
For Start time, choose the earliest date from which to mine query history. If you're using the offline extraction method to extract query history from Databricks, skip this field. If running the miner for the first time, Atlan recommends setting a start date around three days prior to the current date and then scheduling it daily to build up to two weeks of query history. Mining two weeks of query history on the first miner run may cause delays. For all subsequent runs, Atlan requires a minimum lag of 24 to 48 hours to capture all the relevant transformations that were part of a session. Learn more about the miner logic at Troubleshooting usage and popularity metrics.
For Excluded Users, type the names of users to be excluded while calculating usage metrics for Databricks assets. Press enter after each name to add more names.

Run extractor

To check for any permissions or other configuration issues before running the extractor, click Preflight checks. This is currently only supported when using REST API and offline extraction methods. If you're using system tables, skip to step 2.
Choose how to run the extractor:
- To run the extractor once immediately, at the bottom of the screen, click Run.
- To schedule the extractor to run hourly, daily, weekly, or monthly, at the bottom of the screen, click Schedule Run.

Once the extractor completes, you can view lineage for Databricks assets.

Prerequisites​

Create extraction workflow​

Configure lineage extraction​

Configure usage extraction​

Configure usage settings​

Run extractor​

See also​