Skip to main content

Azure Data Lake Storage crawler

The Azure Data Lake Storage (ADLS) crawler fetches account, container, and object metadata from ADLS and publishes it to Atlan for discovery. It creates or reuses an ADLS connection, scans the configured storage accounts, and applies your filters so that only the relevant containers, folders, and objects appear in Atlan. This reference describes each workflow field that you configure in the crawler UI.

Access

The Azure Data Lake Storage crawler isn't enabled by default. To use this crawler, contact Atlan support and request it be added to your tenant. After it's enabled, only admins or users with workflow permissions can set up and run Azure Data Lake Storage crawlers.

Workflow name

Specifies a unique and descriptive name for this crawler configuration in Atlan. The name appears in the workflows list and run history so that you can distinguish it from other Azure or storage crawlers.

azure-adls-production-catalog

Authentication

The crawler authenticates to Azure Data Lake Storage by using an Azure AD application and its service principal. The service principal must have sufficient permissions on all accounts and containers that you want to crawl. At a minimum, it needs permission to list storage accounts, containers, and blobs, and to read blob service properties and container metadata so that the crawler can discover and publish assets without errors.

The service principal must have the following permissions for the crawler to work correctly:

  • List storage accounts: Microsoft.Storage/storageAccounts/read
  • Get blob service properties
  • List containers
  • Get container properties
  • List blobs: the available Azure permission for this operation, Microsoft.Storage/storageAccounts/blobServices/containers/blobs/read, also grants read access to blob content. There is no permission that only lists blobs without read access, so this permission is required.

You can create a custom role with the required permissions and assign it to the service principal:

{
"properties": {
"roleName": "<role-name>",
"description": "<description>",
"assignableScopes": [
"/subscriptions/<id>"
],
"permissions": [
{
"actions": [
"Microsoft.Storage/storageAccounts/read",
"Microsoft.Storage/storageAccounts/blobServices/read",
"Microsoft.Storage/storageAccounts/blobServices/containers/read"
],
"notActions": [],
"dataActions": [
"Microsoft.Storage/storageAccounts/blobServices/containers/blobs/read"
],
"notDataActions": []
}
]
}
}

Azure client ID

The unique application (client) ID that Azure AD assigned to your app when you registered it.

  • Use the value from Azure portal → Microsoft Entra ID → App registrations → [your app] → Application (client) ID.
  • The value is a GUID in the format generated by Azure AD.

Azure client secret

The client secret that the crawler uses together with the client ID to obtain tokens from Azure AD.

  • Use a current, active client secret created for the app registration.
  • Store the secret securely and rotate it according to your security policies.

Network configuration

Use these properties to control whether the crawler connects to Azure Data Lake Storage over public endpoints or private network paths.

Use the public endpoint of each storage account to connect to ADLS over the internet.

  • Select this when storage accounts are available through their standard public endpoints.
  • Make sure outbound connectivity from Atlan to Azure storage endpoints is allowed.
  • Use this option when you don't use Private Link for the configured storage accounts.

Azure tenant ID

The unique identifier of the Azure Active Directory tenant where the app registration and storage accounts exist.

  • You can find the Azure tenant ID in Azure portal → Microsoft Entra ID → Overview → Tenant ID.

Storage account name list

Comma-separated list of Azure storage account names that the crawler must scan.

  • Use the account names exactly as they appear in Azure portal → Storage accounts.
  • The order of account names must match the order of private links when you use multiple private links.
adlsprod01, adlsprod02, adlsanalytics

Metadata filters

Use these fields to restrict which containers and objects the crawler publishes to Atlan. Filters apply only during publishing. The service principal still needs access to all objects, even if they're later filtered out.

Container prefix

Limits ingestion to containers whose names start with the specified prefix.

  • Leave empty to include all containers in each storage account.
  • Use when containers are grouped by naming convention, such as raw-, staging-, or prod-.
raw-

Object prefix

Limits ingestion to objects whose names start with the specified prefix.

  • Leave empty to include all objects.
  • Use when objects are organized in folder-style prefixes that you want to include or exclude as a group.
landing/

Object regex

Regular expression used to match object names.

  • Use when you need more granular control than a simple prefix.
  • Leave empty to include all objects.
.*\.parquet$

Object create date (after)

Only publishes objects whose creation date is on or after the specified date.

  • Use this to start ingestion from a specific point in time.
  • Leave at the default value to avoid applying a lower bound filter.
1995-12-01

Object create date (before)

Only publishes objects whose creation date is on or before the specified date.

  • Use this to stop ingestion at a specific point in time.
  • Leave at the default value to avoid applying an upper bound filter.
2031-07-05

Object update date (after)

Only publishes objects whose last update date is on or after the specified date.

  • Use this when you want to ingest objects updated after a given date.
  • Leave at the default value to avoid applying a lower bound update filter.
1995-12-01

Object update date (before)

Only publishes objects whose last update date is on or before the specified date.

  • Use this when you want to ignore objects updated after a given date.
  • Leave at the default value to avoid applying an upper bound update filter.
2031-07-05

Include folders

Specifies whether the crawler publishes folder-like paths as separate assets in Atlan.

  • Yes: Publishes folders as assets so that you can browse folder hierarchies in Atlan.
  • No: Publishes only objects.
Yes
Filters applied after extraction

All filters except Container prefix and Object prefix are applied after the crawler extracts objects from storage. The service principal must still have access to all objects, even if some are later filtered out before publishing to Atlan.

Connection name

Defines the name of the connection that the crawler creates or reuses in Atlan for Azure Data Lake Storage assets.

  • The connection name must be unique across all Azure Data Lake Storage connections in your workspace.
  • If a connection with this name already exists, the crawler reuses it and skips creation. In the workflow UI, this field appears as a dropdown so that you can select an existing Azure Data Lake Storage connection. You can also provide a new connection name, which the crawler uses to create a new connection when the workflow runs.

See also