How to extract lineage and usage from Databricks
Once you have crawled assets from Databricks, you can retrieve lineage from Unity Catalog and usage and popularity metrics from query history or system tables. This is supported for all three authentication methods: personal access token, AWS service principal, and Azure service principal.
Both Atlan and Databricks strongly recommend using the system tables method to extract lineage and usage and popularity metrics from Databricks.
Usage and popularity metrics can be retrieved for all Databricks users. However, your Databricks workspace must be Unity Catalog-enabled for the retrieval of lineage and usage and popularity metrics to succeed. You may also need to upgrade existing tables and views to Unity Catalog, as well as reach out to your Databricks account executive to enable lineage in Unity Catalog. (As of publishing, the feature is still in preview from Databricks on AWS and Azure.)
To retrieve lineage and usage from Databricks, rev
iew the order of operations and then complete the following steps.
Select the extractor
To select the Databricks lineage and usage extractor:
- In the top right of any screen, navigate to New and then click New Workflow.
- From the filters along the top, click Miner.
- From the list of packages, select Databricks Miner and click on Setup Workflow.
Configure the lineage extractor
Choose your lineage extraction method:
- In REST API, Atlan connects to your database and extracts lineage directly.
- In Offline, you will need to first extract lineage yourself and make it available in S3.
- In System Table, Atlan connects to your database and queries system tables to extract lineage directly.
REST API
To configure the Databricks lineage extractor:
- For Connection, select the connection to extract. (To select a connection, the crawler must have already run.)
- Click Next to proceed.
Offline extraction method
Atlan supports the offline extraction method for extracting lineage from Databricks This method uses Atlan's databricks-extractor tool to extract lineage. You will need to first extract lineage yourself and make it available in S3.
To enter your S3 details:
- For Connection, select the connection to extract. (To select a connection, the crawler must have already run.)
- For Bucket name, enter the name of your S3 bucket.
- For Bucket prefix, enter the S3 prefix under which all the metadata files exist. These include
extracted-lineage/result-0.json
,extracted-query-history/result-0.json
, and so on. - For Bucket region, enter the name of the S3 region.
- When complete, at the bottom of the screen, click Next.
System table
To configure the Databricks lineage extractor:
- For Connection, select the connection to extract. (To select a connection, the crawler must have already run.)
- *Extraction Catalog Type:
-
Default: Select to fetch lineage from the system catalog and
access
schema. -
Cloned_catalog: Select to fetch lineage from a cloned catalog and schema.
Before proceeding, make sure the following prerequisites are met:- You have already created cloned views named
column_lineage
andtable_lineage
in your schema.
If not, follow the steps in Create cloned views of system tables. - The
atlan-user
must haveSELECT
permissions on both views to access lineage data.
Then, provide values for the following fields:
- Cloned Catalog Name – Catalog containing the cloned views.
- Cloned Schema Name – Schema containing the cloned views.
- You have already created cloned views named
-
- For SQL Warehouse ID, enter the ID you copied from your SQL warehouse.
- Click Next to proceed.
(Optional) Configure the usage extractor
Atlan extracts usage and popularity metrics from:
This feature is currently limited to queries on SQL warehouses - queries on interactive clusters are not supported. Additionally, expensive queries and compute costs for Databricks assets are currently unavailable due to limitations of the Databricks APIs.
To configure the Databricks usage and popularity extractor:
- For Fetch Query History and Calculate Popularity, click Yes to retrieve usage and popularity metrics for your Databricks assets.
- For Popularity Extraction Method: Choose one of the following methods to extract usage and popularity metrics::
- Click REST API to extract usage and popularity metrics from query history.
- Click System table to extract metrics directly from system tables:
-
Extraction catalog type for popularity: Choose where to fetch popularity data from:
- Default: Uses the system catalog and
query
schema to fetch popularity metrics. - Cloned_catalog: Select to fetch popularity from cloned views in a separate catalog and schema.
Before proceeding: - The
query_history
view must exist in the provided schema. - The
atlan-user
must haveSELECT
permission on the view.
Then provide:
- Cloned Catalog Name – The catalog that contains the
query_history
view. - Cloned Schema Name – The schema that contains the
query_history
view.
For more information, see Create cloned views of system tables.
- Default: Uses the system catalog and
-
For SQL Warehouse ID, enter the ID you copied from your SQL warehouse.
-
- Configure the usage extractor:
- For Popularity Window (days), 30 days is the maximum limit. You can set a shorter popularity window of less than 30 days.
- For Start time, choose the earliest date from which to mine query history. If you're using the offline extraction method to extract query history from Databricks, skip to the next step.
- For Excluded Users, type the names of users to be excluded while calculating usage metrics for Databricks assets. Press
enter
after each name to add more names.
If running the miner for the first time, Atlan recommends setting a start date around three days prior to the current date and then scheduling it daily to build up to two weeks of query history. Mining two weeks of query history on the first miner run may cause delays. For all subsequent runs, Atlan requires a minimum lag of 24 to 48 hours to capture all the relevant transformations that were part of a session. Learn more about the miner logic here.
Run the extractor
To run the Databricks lineage and popularity extractor, after completing the steps above:
- To check for any permissions or other configuration issues before running the crawler, click Preflight checks. This is currently only supported when using REST API and offline extraction methods. If you're using system tables, skip to step 2.
- You can either:
- To run the crawler once immediately, at the bottom of the screen, click the Run button.
- To schedule the crawler to run hourly, daily, weekly, or monthly, at the bottom of the screen, click the Schedule Run button.
Once the extractor has completed running, you will see lineage for Databricks assets! 🎉