Skip to main content

Extract lineage and usage from Databricks

Retrieve lineage from Unity Catalog and usage and popularity metrics from query history or system tables. Both Atlan and Databricks strongly recommend using the system tables method.

Prerequisites

Before you begin, make sure you have:

Create extraction workflow

To extract lineage and usage from Databricks:

  1. In the top right of any screen, navigate to New and then click New Workflow.
  2. From the filters along the top, click Miner.
  3. From the list of packages, select Databricks Miner and click Setup Workflow.

Configure lineage extraction

Atlan connects to your database and extracts lineage directly.

  1. For Connection, select the connection to extract. The crawler must have already run for you to select a connection.
  2. Click Next.

Configure usage extraction

Important!

Query history is currently in public preview for Databricks.

Atlan extracts usage and popularity metrics from query history or system tables. This feature is currently limited to queries on SQL warehouses; queries on interactive clusters aren't supported. Additionally, expensive queries and compute costs for Databricks assets are currently unavailable due to limitations of the Databricks APIs.

  1. For Fetch Query History and Calculate Popularity, click Yes to retrieve usage and popularity metrics for your Databricks assets. To skip this step, click No and proceed to Run extractor.
  2. For Popularity Extraction Method, choose one of the following:
    • REST API: Extract usage and popularity metrics from query history.
    • System table: Extract metrics directly from system tables. Then configure:
      • Extraction catalog type for popularity: Choose where to fetch popularity data from:
        • Default: Uses the system catalog and query schema to fetch popularity metrics.
        • Cloned_catalog: Select to fetch popularity from cloned views in a separate catalog and schema. Before proceeding:
          • The query_history view must exist in the provided schema.
          • The atlan-user must have SELECT permission on the view. Then provide:
          • Cloned Catalog Name: The catalog that contains the query_history view.
          • Cloned Schema Name: The schema that contains the query_history view. For more information, see Create cloned views of system tables.
      • SQL Warehouse ID: Enter the ID you copied from your SQL warehouse.

Configure usage settings

  1. For Popularity Window (days), enter the number of days to include. The maximum limit is 30 days.
  2. For Start time, choose the earliest date from which to mine query history. If you're using the offline extraction method to extract query history from Databricks, skip this field. If running the miner for the first time, Atlan recommends setting a start date around three days prior to the current date and then scheduling it daily to build up to two weeks of query history. Mining two weeks of query history on the first miner run may cause delays. For all subsequent runs, Atlan requires a minimum lag of 24 to 48 hours to capture all the relevant transformations that were part of a session. Learn more about the miner logic at Troubleshooting usage and popularity metrics.
  3. For Excluded Users, type the names of users to be excluded while calculating usage metrics for Databricks assets. Press enter after each name to add more names.

Run extractor

  1. To check for any permissions or other configuration issues before running the extractor, click Preflight checks. This is currently only supported when using REST API and offline extraction methods. If you're using system tables, skip to step 2.
  2. Choose how to run the extractor:
    • To run the extractor once immediately, at the bottom of the screen, click Run.
    • To schedule the extractor to run hourly, daily, weekly, or monthly, at the bottom of the screen, click Schedule Run.

Once the extractor completes, you can view lineage for Databricks assets.

See also