Crawl Amazon DocumentDB
Create an Amazon DocumentDB crawler workflow to extract and catalog metadata from your DocumentDB databases, collections, and inferred field schemas in Atlan. Amazon DocumentDB is crawled only through Self-Deployed Runtime running in the same VPC as your cluster. This guide walks you through installing the runtime, configuring the connection, and running the crawler.
Prerequisites
Before you begin, make sure you have:
- Reviewed the order of operations for connecting data sources to Atlan.
- Set up Amazon DocumentDB and created a crawl user with appropriate permissions.
- The Amazon DocumentDB CA certificate (the global CA bundle
global-bundle.pem), if your cluster requires TLS. For details, see How do I configure TLS CA certificates for DocumentDB? in the FAQ.
Create crawler workflow
To create an Amazon DocumentDB crawler workflow:
- In the top navigation, click Marketplace.
- Search for AWS DocumentDB Assets and select it.
- Click Install.
- Once installation completes, click Setup Workflow on the same tile.
If you navigated away before installation completed, go to New > New Workflow and select AWS DocumentDB Assets to proceed.
Choose extraction method
Amazon DocumentDB is crawled only through Self-Deployed Runtime. Choose the Self-deployed runtime tab below to configure the connection.
- Direct
- Self-deployed runtime
Direct extraction—where Atlan Cloud connects to your cluster over the internet—isn't supported for Amazon DocumentDB. DocumentDB clusters have no public endpoint by design, so you must crawl them through Self-Deployed Runtime instead. Select the Self-deployed runtime tab to continue.
For why this is the only supported method, see Why is Amazon DocumentDB supported only through self-deployed runtime? in the FAQ.
In Self-deployed runtime extraction, the runtime executes metadata extraction within your organization's environment, inside the same VPC as your DocumentDB cluster.
-
Install Self-Deployed Runtime in the same VPC as your DocumentDB cluster, if you haven't already:
-
On the extraction method step, select the Agent tab and choose your Self-Deployed Runtime.
-
Store sensitive information—such as the crawl user password or IAM credentials—in the secret store configured with your Self-Deployed Runtime, then reference those secrets in the corresponding fields. For more information, see Retrieve credentials.
-
Provide the connection details:
-
For Cluster endpoint, enter the hostname of your DocumentDB cluster endpoint. This is the network address where your DocumentDB cluster accepts connections.
-
For Port, enter the port number on which DocumentDB is listening. The default port is
27017. -
For Authentication type, select the method the runtime uses to authenticate with your cluster:
- Basic: Username and password authentication using SCRAM-SHA-1.
- IAM: AWS Identity and Access Management authentication using the MONGODB-AWS mechanism.
-
Provide the credentials for the selected authentication type:
- For Basic authentication, enter the Username and Password of the crawl user you created for Atlan.
- For IAM authentication, provide the AWS credentials for the IAM user or role mapped to a DocumentDB user.
-
For Authentication database, enter the name of the database where the user credentials are stored. Typically, this is
admin, but it can be any database where the user was created. -
For TLS CA Certificate, provide the Amazon DocumentDB CA certificate (the global CA bundle
global-bundle.pem, available from https://truststore.pki.rds.amazonaws.com/global/global-bundle.pem) using any of the input forms supported in SDR mode: upload it through the Atlan UI, reference a base64-encoded key in your secret store, or reference anobjectstore://path. Provide the complete bundle, not a single extracted certificate. For details, see How do I configure TLS CA certificates for DocumentDB? in the FAQ. -
For Replica set, enter the name of the cluster's replica set. The default is
rs0. -
Select Skip TLS hostname verification if the runtime reaches your cluster through an SSH tunnel or another intermediate hop where the hostname presented in the TLS certificate doesn't match the address the runtime connects to.
-
Select Direct node connection to connect to a single node rather than discovering the replica set topology. This is the MongoDB
directConnectiondriver option—use it when you want the runtime to connect to a specific node instead of performing replica-set discovery. It's a per-connection setting within Self-Deployed Runtime and is unrelated to how Atlan reaches your cluster.
-
-
Navigate to the bottom of the screen and click Next.
Configure connection
To complete the Amazon DocumentDB connection configuration:
- Provide a Connection Name that represents your source environment. For example, you might use values like
production,development,gold, oranalytics. - To change the users who are able to manage this connection, change the users or groups listed under Connection Admins. If you don't specify any user or group, no one can manage the connection, not even admins.
- Navigate to the bottom of the screen and click Next to proceed.
Configure crawler
Before running the Amazon DocumentDB crawler, you can further configure it.
On the Metadata Filters page, you can control which databases and collections Atlan crawls and how schemas are inferred. If an asset appears in both the include and exclude filters, the exclude filter takes precedence.
- To control which databases and collections are crawled, provide include and exclude filters. Each filter is a JSON object that maps a database-name regular expression to a list of collection-name regular expressions. An empty list matches all collections in the matched databases.
- For example, the following include filter crawls all collections in every database whose name starts with
sales_:{"^sales_.*$": []}
- For example, the following include filter crawls all collections in every database whose name starts with
- To set the number of documents to sample from each collection for field inference, adjust the value in the Sample size field. The default is
100. For details on how this parameter affects extraction performance and field inference accuracy, see What does sample size affect? in the FAQ.
For details on how database and collection filters are applied, see How do database and collection filters work? in the FAQ.
Run crawler
To run the Amazon DocumentDB crawler, after completing the previous steps:
- To run the crawler once, immediately, at the bottom of the screen, click the Run button.
- To schedule the crawler to run hourly, daily, weekly, or monthly, at the bottom of the screen, click the Schedule & Run button.
Once the crawler completes running, you can see the assets on Atlan's asset page.
See also
- What does Atlan crawl from Amazon DocumentDB?: Learn about the DocumentDB assets and metadata that Atlan discovers and catalogs.
- Field extraction and schema inference: Find answers to questions about field extraction, schema inference, permissions, and configuration.
- How Atlan connects to Amazon DocumentDB: Connection protocols, ports, and security.