Extract on-premises Databricks lineage

Extract lineage from your on-premises Databricks instances using the databricks-extractor tool, upload the results to S3, and import them into Atlan.

Prerequisites

Before you begin, make sure you have:

Set up the databricks-extractor tool
Access to the server with Docker Compose installed
Access to an S3 bucket for uploading the extracted metadata.
- Use the same S3 bucket that Atlan uses to avoid access issues. Reach out to your Data Success Manager to get the details of your Atlan bucket,
- Or refer to Create your own S3 bucket in the dbt documentation to create your own.

Run Databricks extractor tool

To extract lineage for a specific Databricks connection:

Log into the server with Docker Compose installed.
Change to the directory containing the compose file.
Run Docker Compose with the connection name:
```
sudo docker-compose up <connection-name>
```
- Replace <connection-name> with the name of the connection from the services section of the compose file.

The tool generates folders with JSON files for each service, including extracted-lineage and extracted-query-history (if EXTRACT_QUERY_HISTORY is set to true). You can optionally inspect the lineage and usage metadata to verify it's acceptable for providing metadata to Atlan.

Upload generated files to S3

To provide Atlan access to the extracted lineage and usage metadata:

Make sure that all files for a particular connection have the same prefix. This ensures Atlan can identify all related files. For example, output/databricks-lineage-example/extracted-lineage/result-0.json and output/databricks-lineage-example/extracted-query-history/result-0.json.
Upload the files to the S3 bucket using your preferred method. For example, to upload all files using the AWS CLI:

aws s3 cp output/databricks-lineage-example s3://my-bucket/metadata/databricks-lineage-example --recursive

Next steps

Extract lineage and usage from Databricks: Select Offline for the extraction method to import the lineage you extracted on-premises.

Prerequisites​

Run Databricks extractor tool​

Upload generated files to S3​

Next steps​

Prerequisites

Run Databricks extractor tool

Upload generated files to S3

Next steps