Skip to main content

Extract lineage and usage from Databricks

Retrieve lineage from Unity Catalog and usage and popularity metrics from system tables.

Deprecated

The REST API extraction method for lineage and popularity is deprecated. Use the system tables method instead, which provides more reliable and comprehensive extraction. The REST API extraction method is scheduled for removal in a future release.

Prerequisites

Before you begin, make sure you have:

Create extraction workflow

To extract lineage and usage from Databricks:

  1. In the top right of any screen, navigate to New and then click New Workflow.
  2. From the filters along the top, click Miner.
  3. From the list of packages, select Databricks Miner and click Setup Workflow.

Configure lineage extraction

Atlan connects to your database and queries system tables to extract lineage directly.

  1. For Connection, select the connection to extract. The crawler must have already run for you to select a connection.

  2. For Extraction Catalog Type, choose one of the following:

    • Default: Select to fetch lineage from the system catalog and access schema.

    • Cloned_catalog: Select to fetch lineage from a cloned catalog and schema. Before proceeding, make sure the following prerequisites are met:

      • You have already created cloned views named column_lineage and table_lineage in your schema. If not, follow the steps in Create cloned views of system tables.
      • The atlan-user must have SELECT permissions on both views to access lineage data.

      Then, provide values for the following fields:

      • Cloned Catalog Name: Catalog containing the cloned views.
      • Cloned Schema Name: Schema containing the cloned views.
  3. For SQL Warehouse ID, enter the ID you copied from your SQL warehouse.

  4. If you want to enable lineage tracking at the file path level for volumes or external locations, enable File Path Lineage.

  5. Click Next.

Configure usage extraction

Important!

Query history is currently in public preview for Databricks.

Atlan extracts usage and popularity metrics from query history or system tables. This feature is currently limited to queries on SQL warehouses; queries on interactive clusters aren't supported. Additionally, expensive queries and compute costs for Databricks assets are currently unavailable due to limitations of the Databricks APIs.

  1. For Fetch Query History and Calculate Popularity, click Yes to retrieve usage and popularity metrics for your Databricks assets. To skip this step, click No and proceed to Run extractor.
  2. For Popularity Extraction Method, choose one of the following:
    • REST API (Deprecated): Extract usage and popularity metrics from query history. This method is deprecated — use system tables instead.
    • System table (Recommended): Extract metrics directly from system tables. Then configure:
      • Extraction catalog type for popularity: Choose where to fetch popularity data from:
        • Default: Uses the system catalog and query schema to fetch popularity metrics.
        • Cloned_catalog: Select to fetch popularity from cloned views in a separate catalog and schema. Before proceeding:
          • The query_history view must exist in the provided schema.
          • The atlan-user must have SELECT permission on the view. Then provide:
          • Cloned Catalog Name: The catalog that contains the query_history view.
          • Cloned Schema Name: The schema that contains the query_history view. For more information, see Create cloned views of system tables.
      • SQL Warehouse ID: Enter the ID you copied from your SQL warehouse.

Configure usage settings

  1. For Popularity Window (days), enter the number of days to include. The maximum limit is 30 days.
  2. For Start time, choose the earliest date from which to mine query history. If you're using the offline extraction method to extract query history from Databricks, skip this field. If running the miner for the first time, Atlan recommends setting a start date around three days prior to the current date and then scheduling it daily to build up to two weeks of query history. Mining two weeks of query history on the first miner run may cause delays. For all subsequent runs, Atlan requires a minimum lag of 24 to 48 hours to capture all the relevant transformations that were part of a session. Learn more about the miner logic at Troubleshooting usage and popularity metrics.
  3. For Excluded Users, type the names of users to be excluded while calculating usage metrics for Databricks assets. Press enter after each name to add more names.

Run extractor

  1. To check for any permissions or other configuration issues before running the extractor, click Preflight checks. This is currently only supported when using REST API and offline extraction methods. If you're using system tables, skip to step 2.
  2. Choose how to run the extractor:
    • To run the extractor once immediately, at the bottom of the screen, click Run.
    • To schedule the extractor to run hourly, daily, weekly, or monthly, at the bottom of the screen, click Schedule Run.

Once the extractor completes, you can view lineage for Databricks assets.

See also