Crawl Amazon DocumentDB

Connect docs via MCP

Create an Amazon DocumentDB crawler workflow to extract and catalog metadata from your DocumentDB databases, collections, and inferred field schemas in Atlan. Amazon DocumentDB is crawled only through Self-Deployed Runtime running in the same VPC as your cluster. This guide walks you through installing the runtime, configuring the connection, and running the crawler.

Prerequisites

Before you begin, make sure you have:

Reviewed the order of operations for connecting data sources to Atlan.
Set up Amazon DocumentDB and created a crawl user with appropriate permissions.
The Amazon DocumentDB CA certificate (the global CA bundle global-bundle.pem), if your cluster requires TLS. For details, see How do I configure TLS CA certificates for DocumentDB? in the FAQ.

Create crawler workflow

To create an Amazon DocumentDB crawler workflow:

In the top navigation, click Marketplace.
Search for AWS DocumentDB Assets and select it.
Click Install.
Once installation completes, click Setup Workflow on the same tile.

If you navigated away before installation completed, go to New > New Workflow and select AWS DocumentDB Assets to proceed.

Choose extraction method

Amazon DocumentDB is crawled only through Self-Deployed Runtime. Choose the Self-deployed runtime tab below to configure the connection.

Direct
Self-deployed runtime

In Self-deployed runtime extraction, the runtime executes metadata extraction within your organization's environment, inside the same VPC as your DocumentDB cluster.

Install Self-Deployed Runtime in the same VPC as your DocumentDB cluster, if you haven't already:
- Install via Docker Compose
- Install on Kubernetes
On the extraction method step, select the Agent tab and choose your Self-Deployed Runtime.
Store sensitive information—such as the crawl user password or IAM credentials—in the secret store configured with your Self-Deployed Runtime, then reference those secrets in the corresponding fields. For more information, see Retrieve credentials.
Provide the connection details:
1. For Cluster endpoint, enter the hostname of your DocumentDB cluster endpoint. This is the network address where your DocumentDB cluster accepts connections.
2. For Port, enter the port number on which DocumentDB is listening. The default port is 27017.
3. For Authentication type, select the method the runtime uses to authenticate with your cluster:
  - Basic: Username and password authentication using SCRAM-SHA-1.
  - IAM: AWS Identity and Access Management authentication using the MONGODB-AWS mechanism.
4. Provide the credentials for the selected authentication type:
  - For Basic authentication, enter the Username and Password of the crawl user you created for Atlan.
  - For IAM authentication, provide the AWS credentials for the IAM user or role mapped to a DocumentDB user.
5. For Authentication database, enter the name of the database where the user credentials are stored. Typically, this is admin, but it can be any database where the user was created.
6. For TLS CA Certificate, provide the Amazon DocumentDB CA certificate (the global CA bundle global-bundle.pem, available from https://truststore.pki.rds.amazonaws.com/global/global-bundle.pem) using any of the input forms supported in SDR mode: upload it through the Atlan UI, reference a base64-encoded key in your secret store, or reference an objectstore:// path. Provide the complete bundle, not a single extracted certificate. For details, see How do I configure TLS CA certificates for DocumentDB? in the FAQ.
7. For Replica set, enter the name of the cluster's replica set. The default is rs0.
8. Select Skip TLS hostname verification if the runtime reaches your cluster through an SSH tunnel or another intermediate hop where the hostname presented in the TLS certificate doesn't match the address the runtime connects to.
9. Select Direct node connection to connect to a single node rather than discovering the replica set topology. This is the MongoDB directConnection driver option—use it when you want the runtime to connect to a specific node instead of performing replica-set discovery. It's a per-connection setting within Self-Deployed Runtime and is unrelated to how Atlan reaches your cluster.
Navigate to the bottom of the screen and click Next.

Configure connection

To complete the Amazon DocumentDB connection configuration:

Provide a Connection Name that represents your source environment. For example, you might use values like production, development, gold, or analytics.
To change the users who are able to manage this connection, change the users or groups listed under Connection Admins. If you don't specify any user or group, no one can manage the connection, not even admins.
Navigate to the bottom of the screen and click Next to proceed.

Configure crawler

Before running the Amazon DocumentDB crawler, you can further configure it.

On the Metadata Filters page, you can control which databases and collections Atlan crawls and how schemas are inferred. If an asset appears in both the include and exclude filters, the exclude filter takes precedence.

To control which databases and collections are crawled, provide include and exclude filters. Each filter is a JSON object that maps a database-name regular expression to a list of collection-name regular expressions. An empty list matches all collections in the matched databases.
- For example, the following include filter crawls all collections in every database whose name starts with sales_:
```
{"^sales_.*$": []}
```
To set the number of documents to sample from each collection for field inference, adjust the value in the Sample size field. The default is 100. For details on how this parameter affects extraction performance and field inference accuracy, see What does sample size affect? in the FAQ.

For details on how database and collection filters are applied, see How do database and collection filters work? in the FAQ.

Run crawler

To run the Amazon DocumentDB crawler, after completing the previous steps:

To run the crawler once, immediately, at the bottom of the screen, click the Run button.
To schedule the crawler to run hourly, daily, weekly, or monthly, at the bottom of the screen, click the Schedule & Run button.

Once the crawler completes running, you can see the assets on Atlan's asset page.

Prerequisites​

Create crawler workflow​

Choose extraction method​

Configure connection​

Configure crawler​

Run crawler​

See also​