Databricks connectivity

Did you know?

The documentation refers to both SQL endpoint and interactive cluster as compute engine below.

Does Atlan consider expensive queries and compute costs?

No, Atlan doesn't factor in expensive queries or compute costs due to limitations in the Databricks APIs, which don't expose this information.

How does Atlan calculate popularity for Databricks assets?

Atlan calculates popularity for tables, views, and columns in Databricks by analyzing query execution data. It retrieves query history from the system.query.history table and specifically filters for execution_status = 'FINISHED' and statement_type = 'SELECT' to determine how frequently assets are accessed.

How to debug test authentication and preflight check errors?

Hostname resolution error

Provided Host name cannot be resolved via DNS, please check and try again.

The hostname you have provided can't be resolved through DNS. Check that the hostname is correct.
Verify that the DNS settings have been configured properly.

Invalid client ID or secret

Provided Client ID is invalid, please check and try again.

The client ID or secret you have provided is either invalid or no longer working. Follow the steps for AWS or Azure setup to generate new credentials.

Invalid tenant ID

Provided tenant ID is invalid, please check and try again.

The tenant ID you have provided is incorrect.
Ensure that the tenant ID you have provided corresponds to the one in your Microsoft Entra ID application.

Unity Catalog not linked

Configured Databricks instance doesn't have Unity Catalog linked. Please choose JDBC extraction instead of REST API in Atlan.

If you have not set up Unity Catalog in your Databricks workspace, you can change the extraction method to JDBC instead of REST API to crawl your Databricks assets in Atlan.

Connection timeout

Failed to connect to Databricks (connection timed out). Please check your host and port and try again.

The connection to the Databricks instance has timed out.
Verify that the host and port are correct.
Check that no firewall rules or network issues are blocking the connection.

Invalid HTTP path

Provided HTTP path is invalid, please check and try again.

The HTTP path you have provided is invalid.
Ensure that the endpoint is properly configured and accessible, and the warehouse ID in the HTTP path is correct.

Invalid personal access token

PAT token is invalid, please check and try again.

The personal access token used for authentication is invalid.
Ensure that the token is valid and neither deleted nor expired.
You can also generate a new personal access token, if needed.

Insufficient permisions for crawling metadata

User doesn't have access to any schemas / dbs, please check the accesses provided to the atlan user and try again.

Check that the service principal or the user who's PAT token is being used has the necessary permissions provided. Refer to the setup doc to understand permissions required for different auth types.

Insufficient permisions for some of the included crawling metadata

Warning, user doesn't have access to the following objects anymore, or the objects no longer exist on the source!, check failed for ...

user doesn't have access to one or more db objects from the include filter, (such as catalogs / schemas).
You can either remove these objects from the include filter if they no longer exist on the source.
Or check that the service principal or the user who's PAT token is being used has the necessary permissions provided. Refer to the setup doc to understand permissions required for different auth types.

Insufficient permisions to crawl tags

User doesn't have access to the following system tables

Check that you have sufficient permissions provided for the tags extraction.

User doesn't have permission to access warehouses

please check your credentials and warehouse access

Check that the configured user / service principal has CAN_USE on the configured SQL warehouse.

Unable to access query history from the source, user doesn't have the access

Check the permissions required for the system tables based lineage extraction are provided.

System table extraction checks failing with

User doesn't have access to the following system tables

Check the permissions required for the system tables based extraction.

General connection failure

Unable to connect to the configured Databricks instance, please check your credentials and configs and then try again. If the problem persists, contact [email protected].

Check that you have entered the host and port correctly.
Verify that the credentials for the connection are correct.
Check that your Databricks instance is properly configured and available.
If the problem still persists after verifying all of the previous steps, contact Atlan support.

Why does the workflow take longer than usual in the extraction step?

Certain Databricks runtime versions don't have an easy way to extract some metadata (for example partitioning, table_type, and format). Extra operations must be performed to retrieve these, resulting in slower performance.
If you aren't already, you may want to try the Unity Catalog extraction method.

Why is some metadata missing?

When using incremental extraction, consider running a one-time full extraction to capture any newly introduced metadata.

Currently, some metadata can't be extracted from Databricks:

Metadata	JDBC	REST API	System Tables
`ViewCount` and `TableCount` (on schemas)	❌	✅	✅
`RowCount` (on tables and views)	❌	❌	❌
`TABLE_KIND` (on tables and views)	❌	❌	❌
`PARTITION_STRATEGY` (on tables and views)	❌	❌	❌
`CONSTRAINT_TYPE` (on columns)	❌	✅	✅
Partition key (on columns)	❌	✅	✅
Table partitioning information	✅	❌	✅
`BYTES`, `SIZEINBYTES` (table size)	❌	❌	❌

The team is exploring ways to bring this metadata into Atlan if Databricks supports extraction of the metadata.

Why doesn't my SQL work when querying Databricks?

Atlan currently supports SparkSQL on Databricks runtime 7.x and above.

Can I use Atlan when the Databricks compute engine isn't running?

Atlan needs the Databricks compute engine to be running for two activities:
- Crawling assets (normal and scheduled run)
- Querying assets (including data previews)
If you don't need to perform the activities listed, your experience shouldn't be affected.
In any other case, you'll get a downgraded experience on Atlan if the compute engine isn't running. Queries won't work as expected and a scheduled workflow might fail after a couple of retries.
The team recommends turning off the Terminate after x minutes of inactivity option in your cluster to avoid these problems. If you have this turned on, any of the listed activities triggers the cluster to come back online within about 30 seconds.

Why can't I see all the assets on Atlan that are available in Databricks?

Have you excluded the database or schema when crawling?
Does the Databricks user you configured for crawling have access to these other assets?

Why is the test authentication taking so long?

Please check the state of the compute engine. It must be in a running state for all operations, including authentication.

What limitations are there with the REST API (Unity Catalog) extraction method?

Currently, schema-level filtering and retrieving table partitioning information aren't supported.

Why has my workflow started to fail when it worked before?

This can happen if the PAT you configured the workflow with has since expired.
You will need to create a new PAT in Databricks, and then modify the workflow configuration in Atlan with this new PAT.
If you are unable to update the PAT, pause the workflow and reach out to us.

How do I migrate to Unity Catalog?

Currently Unity Catalog is in a public preview state.
The Databricks team is working on an automated migration to Unity Catalog.
Currently you must migrate individual tables manually.

Why are some notebooks missing from metadata extraction?

Notebooks stored inside hidden directories (names starting with "." such as .hidden_dir/) are generally not returned by the /api/2.0/workspace/list API endpoint. This may cause missing notebook details in Atlan.

Why is metadata missing for some Databricks entities?

The Databricks APIs used provide data only within a single configured workspace. If an entity used in lineage creation exists outside this workspace, its details won't be available via these APIs.

Does Atlan support nested columns beyond level 30?

Atlan doesn't support nested columns beyond 30 levels for complex types such as Struct, Array, and Map. Columns exceeding this nesting depth aren't parsed. Instead, the deepest column level gets assigned the data type string, and its value contains a string representation of the remaining nested structure.

For example, LEVEL_31 has the data type <LEVEL_32:STRUCT<LEVEL_33:STRUCT<...>>>.

What happens if the service principal loses access to one workspace?

The crawler is resilient to this scenario. During the discovery phase, it fails to connect to that specific workspace, logs it as inaccessible, and simply skips it. The process continues for all other available workspaces without failing the entire run.

Why are assets from a specific catalog not appearing in Atlan?

This is almost always a permission issue. Verify that the service principal has been granted USE CATALOG, BROWSE, and SELECT permissions on the catalog and its contents (schemas, tables).

Why is my lineage view incomplete?

Check the source and target tables of the missing lineage link. The service principal must have SELECT permissions on both tables for lineage to be captured.

For more information on cross-workspace extraction setup, see Set up cross-workspace extraction.

Does Atlan consider expensive queries and compute costs?​

How does Atlan calculate popularity for Databricks assets?​

How to debug test authentication and preflight check errors?​

Why does the workflow take longer than usual in the extraction step?​

Why is some metadata missing?​

Why doesn't my SQL work when querying Databricks?​

Can I use Atlan when the Databricks compute engine isn't running?​

Why can't I see all the assets on Atlan that are available in Databricks?​

Why is the test authentication taking so long?​

What limitations are there with the REST API (Unity Catalog) extraction method?​

Why has my workflow started to fail when it worked before?​

How do I migrate to Unity Catalog?​

Why are some notebooks missing from metadata extraction?​

Why is metadata missing for some Databricks entities?​

Does Atlan support nested columns beyond level 30?​

What happens if the service principal loses access to one workspace?​

Why are assets from a specific catalog not appearing in Atlan?​

Why is my lineage view incomplete?​