Crawl S3 assets

Catalog your Amazon S3 buckets and objects in Atlan using the S3 Assets workflow. This guide walks you through setting up authentication and running your first crawl.

Prerequisites

Before you begin, make sure the you have:

Completed S3 setup with IAM credentials.
Your AWS credentials (Access Key ID and Secret Access Key, or Role ARN) ready.
Information about S3 buckets and prefixes you want to catalog.
Verified that the AWS account is allowlisted to assume the role when using IAM role-based authentication
Set up the destination bucket structure, required only if you plan to use inventory-based ingestion.

Set up workflow

Create a new S3 Assets workflow:

In the top right, select New > New Workflow.
From the package list, select S3 Assets.
Select Setup Workflow.

Configure extraction method

Choose how to connect to your S3 environment:

Direct extraction
Agent extraction

Select Direct for the extraction method.
Choose your authentication type:
- IAM User: Enter your Access Key ID and Secret Access Key.
- IAM Role: Enter your Role ARN.
Select the AWS Region where your buckets are located.
Select Test Authentication to verify the connection.
Select Next.

Choose ingestion method

Select your ingestion method:

Direct ingestion: Recommended for fewer than 1 million objects. This method crawls S3 buckets and objects directly.
Inventory ingestion: Recommended for large-scale use (more than 1 million objects). Uses inventory reports for efficiency.

For inventory ingestion, provide:

S3 Bucket Name: Bucket holding the inventory reports (without the s3:// prefix).
S3 Bucket Prefix: Prefix used in the report configuration. Include a trailing slash (/). Leave empty if no prefix was used.

note

The region for the inventory report is picked from the credentials used in the extraction method.

Configure connection

Enter a Connection Name to identify your S3 environment.
Examples: production-s3, analytics-lake, raw-data-store
Assign Connection Admins to manage access. At least one admin is required.
Bucket assets are mandatory to establish relationships with other two asset types. Configure the scope of assets to be ingested. You can choose to ingest either all three supported asset types (buckets, folders, and objects) or a subset of them.

Configure bucket filters

Choose which buckets and prefixes to include or exclude. Exclude filters override include filters if both match.

For a single bucket:

Include Bucket: Exact bucket name (for example, my-data-bucket)
Include Prefix: Specific prefix to crawl (for example, processed/2024/)
Leave all other filters empty.

For multiple buckets:

Include Bucket: Regex pattern (for example, prod-.* | analytics-.*)
Exclude Bucket: Regex pattern for example, .*-temp | .*-backup)
Include Prefix: Prefixes to include (for example, data/ | reports/)
Exclude Prefix: Prefixes to exclude (for example, archive/ | tmp/)

Run crawler

You can now start cataloging your assets:

Run now: Select Run to start a one-time crawl.
Schedule runs: Select Schedule & Run to automate recurring crawls.

Monitor crawl progress in the activity log. Once complete, your S3 buckets and objects will appear in Atlan.

Troubleshooting

Permissions: Confirm all required IAM permissions are set. See the S3 setup guide for details.

Need help

Contact Atlan support for integration issues or assistance.

Prerequisites​

Set up workflow​

Configure extraction method​

Choose ingestion method​

Configure connection​

Configure bucket filters​

Run crawler​

Troubleshooting​

Need help​

See also​