Skip to main content

Set up on-premises Databricks access

Who can do this?

You will need access to a machine that can run Docker on-premises. You will also need your Databricks instance details, including credentials.

In some cases you will not be able to expose your Databricks instance for Atlan to crawl and ingest metadata. For example, this may happen when security requirements restrict access to sensitive, mission-critical data.

In such cases you may want to decouple the extraction of metadata from its ingestion in Atlan. This approach gives you full control over your resources and metadata transfer to Atlan.

Prerequisites​

To extract metadata from your on-premises Databricks instance, you will need to use Atlan's databricks-extractor tool.

Did you know?

Atlan uses exactly the same databricks-extractor behind the scenes when it connects to Databricks in the cloud.

Install Docker Compose​

Docker Compose is a tool for defining and running applications composed of many Docker containers. (Any guesses where the name came from? πŸ˜‰)

To install Docker Compose:

  1. Install Docker
  2. Install Docker Compose
Did you know?

Instructions provided in this documentation should be enough even if you are completely new to Docker and Docker Compose. However, you can also walk through the Get started with Docker Compose tutorial if you want to learn Docker Compose basics first.

Get the databricks-extractor tool​

To get the databricks-extractor tool:

  1. Raise a support ticket to get the link to the latest version.

  2. Download the image using the link provided by support.

  3. Load the image to the server you'll use to crawl Databricks:

    sudo docker load -i /path/to/databricks-extractor-master.tar

Get the compose file​

Atlan provides you with a Docker compose file for the databricks-extractor tool.

To get the compose file:

  1. Download the latest compose file.
  2. Save the file to an empty directory on the server you'll use to access your on-premises Databricks instance.
  3. The file is docker-compose.yaml.

Define Databricks connections​

The structure of the compose file includes three main sections:

  • x-templates contains configuration fragments. You should ignore this section - do not make any changes to it.
  • services is where you will define your Databricks connections.
  • volumes contains mount information. You should ignore this section as well - do not make any changes to it.

Define services​

For each on-premises Databricks instance, define an entry under services in the compose file.

Each entry will have the following structure:

services:
connection-name:
<<: *extract
environment:
<<: *databricks-defaults
INCLUDE_FILTER: '{"DB_1": [], "DB_2": ["SCHEMA_1", "SCHEMA_2"]}'
EXCLUDE_FILTER: '{"DB_1": ["SCHEMA_1", "SCHEMA_2"]}'
TEMP_TABLE_REGEX: '.*temp.*|.*tmp.*|.*TEMP.*|.*TMP.*'
SYSTEM_SCHEMA_REGEX: '^information_schema$'
volumes:
- ./output/connection-name:/output
  • Replace connection-name with the name of your connection.
  • <<: *extract tells the databricks-extractor tool to run.
  • environment contains all parameters for the tool.
    • INCLUDE_FILTER - specify the databases and schemas from which you want to extract metadata. Remove this line if you want to extract metadata from all databases and schemas.
    • EXCLUDE_FILTER - specify the databases and schemas you want to exclude from metadata extraction. This will take precedence over INCLUDE_FILTER. Remove this line if you do not want to exclude any databases or schemas.
    • TEMP_TABLE_REGEX - specify a regular expression for excluding temporary tables. Remove this line if you do not want to exclude any temporary tables.
    • SYSTEM_SCHEMA_REGEX - specify a regular expression for excluding system schemas. If unspecified, INFORMATION_SCHEMA will be excluded from the extracted metadata by default.
  • volumes specifies where to store results. In this example, the extractor will store results in the ./output/connection-name folder on the local file system.

You can add as many Databricks connections as you want.

Did you know?

Docker's documentation describes the services format in more detail.

Provide credentials​

To define the credentials for your Databricks connections, you will need to provide a Databricks configuration file.

The Databricks configuration is a .ini file with the following format:

[DatabricksConfig]
host = <host>
port = <port>
# seconds to wait for a response from the server
timeout = 300
# Databricks authentication type. Options: personal_access_token, aws_service_principal
auth_type = personal_access_token

# Required only if auth_type is personal_access_token.
[PersonalAccessTokenAuth]
personal_access_token = <personal_access_token>

# Required only if auth_type is aws_service_principal.
[AWSServicePrincipalAuth]
client_id = <client_id>
client_secret = <client_secret>

Secure credentials​

Using local files​

danger

If you decide to keep Databricks credentials in plaintext files, we recommend you restrict access to the directory and the compose file. For extra security, we recommend you use Docker secrets to store the sensitive passwords.

To specify the local files in your compose file:

secrets:
databricks_config:
file: ./databricks.ini
danger

This secrets section is at the same top-level as the services section described earlier. It is not a sub-section of the services section.

Using Docker secrets​

To create and use Docker secrets:

  1. Store the Databricks configuration file:

    sudo docker secret create databricks_config path/to/databricks.ini
  2. At the top of your compose file, add a secrets element to access your secret:

    secrets:
    databricks_config:
    external: true
    name: databricks_config
    • The name should be the same one you used in the docker secret create command above.
    • Once stored as a Docker secret, you can remove the local Databricks configuration file.
  3. Within the service section of the compose file, add a new secrets element and specify the name of the secret within your service to use it.

Example​

Let's explain in detail with an example:

secrets:
databricks_config:
external: true
name: databricks_config

x-templates:
# ...

services:
databricks-example:
<<: *extract
environment:
<<: *databricks-defaults
INCLUDE_FILTER: '{"DB_1": [], "DB_2": ["SCHEMA_1", "SCHEMA_2"]}'
EXCLUDE_FILTER: '{"DB_1": ["SCHEMA_1", "SCHEMA_2"]}'
TEMP_TABLE_REGEX: '.*temp.*|.*tmp.*|.*TEMP.*|.*TMP.*'
SYSTEM_SCHEMA_REGEX: '^information_schema$'
volumes:
- ./output/databricks-example:/output
secrets:
- databricks_config
  1. In this example, we've defined the secrets at the top of the file (you could also define them at the bottom). The databricks_config refers to an external Docker secret created using the docker secret create command.
  2. The name of this service is databricks-example. You can use any meaningful name you want.
  3. The <<: *databricks-defaults sets the connection type to Databricks.
  4. The ./output/databricks-example:/outputΒ line tells the extractor where to store results. In this example, the extractor will store results in theΒ ./output/databricks-exampledirectory on the local file system. We recommend you output the extracted metadata for different connections in separate directories.
  5. The secrets section within services tells the extractor which secrets to use for this service. Each of these refers to the name of a secret listed at the beginning of the compose file.