Skip to main content

Field extraction and schema inference

Connect docs via MCP

Find answers to common questions about how the Amazon DocumentDB connector extracts field information and infers schemas, including permissions, sampling configuration, TLS certificate setup, filters, and nested field support.

How does field extraction work?

The Amazon DocumentDB connector extracts field information by sampling documents from each collection and running inference rules. Field names, types, and nested schemas are derived from this sample.

This approach is necessary because DocumentDB, like MongoDB, doesn't provide a mechanism to retrieve field-level schema information out of the box. The connector runs find() on each collection, samples documents, walks them recursively, and infers a unified field schema.

To enable field extraction, the crawl user must have the find privilege on collections. The connector uses this privilege to read documents and sample data to infer field types. For details on configuring user permissions, see Create crawl user.

Which privileges do custom DocumentDB roles require?

When creating a custom role for the Amazon DocumentDB connector, the role must grant the following privileges:

  • listDatabases - to list all existing databases in the cluster.
  • listCollections - to list collections in a database.
  • collStats - to retrieve collection metadata such as average document size, document count, index size, and more.
  • find - to read collection documents. Atlan requires this action to sample documents and infer field information.

For details on creating a custom role with these privileges, see Create crawl user in the setup guide.

What happens when read permission on collections is missing?

If the crawl user doesn't have read permission (the find action) on collections, only basic metadata is cataloged in Atlan:

  • Databases: Database names and collection counts are cataloged.
  • Collections: Collection names and statistics (document count, size, index information) are cataloged.
  • Columns: Field information isn't available because the connector can't sample documents to infer fields.

The connector requires the find privilege to read documents and infer field information. To get complete metadata extraction including column-level information, grant the crawl user the find privilege on the collections you want to crawl. For detailed steps, see Create crawl user.

What does sample size affect?

The sample size parameter determines how many documents are sampled from each collection during field inference. This parameter is configured in the crawler settings. Adjusting it affects both extraction performance and the completeness of the inferred schema:

  • Performance (job runtime): A smaller sample size results in quicker extraction times and lower resource utilization, making jobs complete faster and minimizing load on your DocumentDB cluster.
  • Accuracy of field inference: A larger sample size increases the likelihood of capturing all field names, types, and nested structures, resulting in more comprehensive metadata.

A small sample size may miss rarely occurring (sparse) fields, while a large sample size provides a better schema picture but can slow down extraction and increase load on the cluster.

Recommendation: Start with the default sample size (100). If you notice that the inferred schema is missing fields or structure (especially sparsely-used fields), incrementally increase the sample size and re-run the workflow until you achieve accurate results without unnecessarily impacting performance.

How are BSON types mapped to Atlan data types?

The connector maps each inferred field's BSON type to an Atlan column data type. Common mappings include:

  • String and ObjectId → STRING
  • Integer → INTEGER, 64-bit integer → LONG
  • Double → DOUBLE, Decimal128 → DECIMAL
  • Boolean → BOOLEAN
  • Date and Timestamp → TIMESTAMP
  • Embedded document → STRUCT
  • Array → ARRAY
  • Binary → BINARY
  • Null → NULL

How are arrays and polymorphic fields handled?

  • Arrays: A field whose values are arrays is mapped to the ARRAY data type. If the first element of the array is an object, the connector walks that object's nested fields and publishes them like any other nested field.
  • Polymorphic fields: When a field has different types across sampled documents, the connector resolves the field to the most commonly observed type in the sample.
  • Sparse fields: Fields that appear in only some documents are captured as long as they occur in the sampled documents. Increasing the sample size improves the chance of capturing sparse fields.

Are nested fields supported?

Yes, the Amazon DocumentDB connector supports nested fields and infers them automatically. The connector publishes nested fields up to 3 levels deep. This is a deliberate performance and usability design decision.

What this means:

Consider a document with the following structure:

{
"customer_id": "12345",
"name": "John Doe",
"address": {
"street": "123 Main St",
"city": "San Francisco",
"country": {
"code": "US",
"name": "United States"
}
},
"orders": [
{
"order_id": "ORD-001",
"total": 99.99
}
]
}

The connector extracts the following fields:

  • Level 1 (top-level fields): customer_id, name, address, orders
  • Level 2 (nested within address): address.street, address.city, address.country
  • Level 3 (nested within address.country): address.country.code, address.country.name
  • Level 2 (nested within the orders array): orders.order_id, orders.total

Fields nested deeper than 3 levels aren't published. The connector uses qualified names with dot notation to represent the hierarchy, making it easy to understand the relationship between parent and nested fields in Atlan.

How are column schemas consolidated?

Atlan consolidates the schemas inferred across all sampled documents in a collection into a unified view, capturing the fields observed with their nesting levels and data types. The connector performs a recursive traversal, converting each unique field path into a column entry while tracking parent-child relationships through qualified names. Because the schema is inferred from a sample, fields that don't appear in any sampled document aren't captured, and nesting beyond 3 levels isn't published.

How do database and collection filters work?

The connector applies include and exclude filters to control which databases and collections are crawled. Each filter is a JSON object that maps a database-name regular expression to a list of collection-name regular expressions. An empty list matches all collections in the matched databases.

For example, the following include filter crawls all collections in every database whose name starts with sales_:

{"^sales_.*$": []}

To crawl only specific collections, list their name patterns instead of leaving the list empty. This include filter crawls only the orders and customers collections in the ecommerce database:

{"^ecommerce$": ["^orders$", "^customers$"]}

You can combine rules for multiple databases in a single filter. This include filter crawls every collection in the analytics database and only the collections whose name starts with fact_ in any database whose name starts with warehouse_:

{"^analytics$": [], "^warehouse_.*$": ["^fact_.*$"]}

Exclude filters use the same structure. This exclude filter skips temporary and archive collections across every database while still crawling everything else:

{"^.*$": ["^tmp_.*$", "^archive_.*$"]}

If an asset appears in both the include and exclude filters, the exclude filter takes precedence. System databases (admin, local, and config) are excluded from extraction. For details on configuring these filters, see Configure crawler.

How do I configure TLS CA certificates for DocumentDB?

When TLS is enabled on your cluster (the default for Amazon DocumentDB), provide the Amazon DocumentDB CA certificate in the TLS CA Certificate field. AWS publishes this certificate as the global CA bundle, global-bundle.pem. Download it from https://truststore.pki.rds.amazonaws.com/global/global-bundle.pem and provide the complete PEM bundle—it contains the full certificate chain that TLS verification requires, so don't extract a single certificate from it.

Because Amazon DocumentDB is crawled through Self-Deployed Runtime, the TLS CA Certificate field is a file input that accepts any of the input forms supported in SDR mode. You can provide global-bundle.pem in any of three ways, and you can pick a different form per file:

  • Upload through the Atlan UI. Click the upload icon and select the file. The file is stored on your Atlan tenant and delivered to the runtime at workflow run time. Best for small, one-off files.
  • Reference a key in your secret store. Enter the key name (for example, atlan/workflows/documentdb/tls-ca). The value at that key must be the base64-encoded contents of the PEM file. Most secret stores cap a single entry around 64 KB, so use the object-store option for larger files.
  • Reference a path in object storage. Enter the path using the objectstore:// scheme (for example, objectstore://atlan-credentials/documentdb/global-bundle.pem). The runtime resolves it using the objectstore binding configured when you installed the runtime.

For details on each form, including how the field detects which one you've used, see Configure file inputs for workflow execution.

If the runtime reaches your cluster through an SSH tunnel or another intermediate hop where the hostname in the TLS certificate doesn't match the address the runtime connects to, select Skip TLS hostname verification. For details on these connection settings, see Choose extraction method in the crawl guide.