Crawl S3 assets
Catalog your Amazon S3 buckets and objects in Atlan using the S3 Assets workflow. This guide walks you through setting up authentication and running your first crawl.
Prerequisites
Before you begin, make sure the you have:
- Completed S3 setup with IAM credentials.
- Your AWS credentials (Access Key ID and Secret Access Key, or Role ARN) ready.
- Information about S3 buckets and prefixes you want to catalog.
- Verified that the AWS account is allowlisted to assume the role when using IAM role-based authentication
- Set up the destination bucket structure, required only if you plan to use inventory-based ingestion.
Set up workflow
Create a new S3 Assets workflow:
- In the top right, select New > New Workflow.
- From the package list, select S3 Assets.
- Select Setup Workflow.
Configure extraction method
Choose how to connect to your S3 environment:
- Direct extraction
- Agent extraction
- Select Direct for the extraction method.
- Choose your authentication type:
- IAM User: Enter your Access Key ID and Secret Access Key.
- IAM Role: Enter your Role ARN.
- Select the AWS Region where your buckets are located.
- Select Test Authentication to verify the connection.
- Select Next.
- Select Agent for the extraction method.
- Add the secret keys for your secret store configuration.
- Follow the Secure Agent configuration guide.
- Select Next.
Choose ingestion method
Select your ingestion method:
- Direct ingestion: Recommended for fewer than 1 million objects. This method crawls S3 buckets and objects directly.
- Inventory ingestion: Recommended for large-scale use (more than 1 million objects). Uses inventory reports for efficiency.
For inventory ingestion, provide:
- S3 Bucket Name: Bucket holding the inventory reports (without the
s3://
prefix). - S3 Bucket Prefix: Prefix used in the report configuration. Include a trailing slash (
/
). Leave empty if no prefix was used.
note
The region for the inventory report is picked from the credentials used in the extraction method.
Configure bucket filters
Choose which buckets and prefixes to include or exclude. Exclude filters override include filters if both match.
For a single bucket:
- Include Bucket: Exact bucket name (e.g.,
my-data-bucket
) - Include Prefix: Specific prefix to crawl (e.g.,
processed/2024/
) - Leave all other filters empty.
For multiple buckets:
- Include Bucket: Regex pattern (e.g.,
prod-.* | analytics-.*
) - Exclude Bucket: Regex pattern (e.g.,
.*-temp | .*-backup
) - Include Prefix: Prefixes to include (e.g.,
data/ | reports/
) - Exclude Prefix: Prefixes to exclude (e.g.,
archive/ | tmp/
)
Configure connection details
- Enter a Connection Name to identify your S3 environment.
Examples:production-s3
,analytics-lake
,raw-data-store
- Assign Connection Admins to manage access. At least one admin is required.
Run crawler
You can now start cataloging your assets:
- Run now: Select Run to start a one-time crawl.
- Schedule runs: Select Schedule & Run to automate recurring crawls.
Monitor crawl progress in the activity log. Once complete, your S3 buckets and objects will appear in Atlan.
Troubleshooting
- Permissions: Confirm all required IAM permissions are set. See the S3 setup guide for details.
Need help
- Contact Atlan support for integration issues or assistance.
See also
- What Atlan crawls from S3: Full list of assets and metadata included in the crawl.