Extract on-premises Databricks lineage
Extract lineage from your on-premises Databricks instances using the databricks-extractor tool, upload the results to S3, and import them into Atlan.
Prerequisites
Before you begin, make sure you have:
- Set up the databricks-extractor tool
- Access to the server with Docker Compose installed
- Access to an S3 bucket for uploading the extracted metadata.
- Use the same S3 bucket that Atlan uses to avoid access issues. Reach out to your Data Success Manager to get the details of your Atlan bucket,
- Or refer to Create your own S3 bucket in the dbt documentation to create your own.
Run Databricks extractor tool
To extract lineage for a specific Databricks connection:
-
Log into the server with Docker Compose installed.
-
Change to the directory containing the compose file.
-
Run Docker Compose with the connection name:
sudo docker-compose up <connection-name>- Replace
<connection-name>with the name of the connection from theservicessection of the compose file.
- Replace
The tool generates folders with JSON files for each service, including extracted-lineage and extracted-query-history (if EXTRACT_QUERY_HISTORY is set to true). You can optionally inspect the lineage and usage metadata to verify it's acceptable for providing metadata to Atlan.
Upload generated files to S3
To provide Atlan access to the extracted lineage and usage metadata:
-
Make sure that all files for a particular connection have the same prefix. This ensures Atlan can identify all related files. For example,
output/databricks-lineage-example/extracted-lineage/result-0.jsonandoutput/databricks-lineage-example/extracted-query-history/result-0.json. -
Upload the files to the S3 bucket using your preferred method. For example, to upload all files using the AWS CLI:
aws s3 cp output/databricks-lineage-example s3://my-bucket/metadata/databricks-lineage-example --recursive
Next steps
- Extract lineage and usage from Databricks: Select Offline for the extraction method to import the lineage you extracted on-premises.