Skip to main content

Field extraction and schema inference

Find answers to common questions about how the MongoDB (self-managed) connector extracts field information and infers schemas, including permissions, sampling configuration, inferred field handling, SSL/TLS setup, filters, and nested field support.

How does the connector extract field information?

The MongoDB connector extracts field information by sampling Collection data and running inference rules. Field names, types, and nested schemas are derived using this approach.

This approach is necessary because MongoDB (self-managed) doesn't provide any mechanism to retrieve field-level schema information out of the box. MongoDB Atlas (cloud offering) offers a Data Federation layer that enables metadata and schema discovery without reading from the underlying data via commands like sqlGetSchema.

To enable field extraction, the database user must have the find privilege on Collections. The connector uses this privilege to read collection documents and sample data to understand schema structure and infer field types. For details on configuring user permissions, see Create database user.

What privileges are required for a custom MongoDB role?

When creating a custom role for the MongoDB connector, the role must grant the following privileges:

  • listDatabases - to list all existing databases in the cluster.
  • listCollections - to list collections in a database.
  • collStats - to retrieve collection metadata such as average document size, document count, and more.
  • dbStats - to retrieve statistics for a given database.
  • find - this action provides read permission on the collection data. Atlan requires this action to read collection documents and derive field information, as the connector samples document data to understand schema structure and infer field types.

For details on creating a custom role with these privileges, see Create database user in the setup guide.

How do MongoDB roles determine user permissions?

The MongoDB connector requires the find privilege on Collections to extract field information by sampling documents and running inference rules. The permissions of the database user are determined by the role assigned to that user.

When you create a database user for the connector, you can assign either:

  • A built-in role (such as readAnyDatabase and clusterMonitor) that automatically includes all required privileges
  • A custom role that you define with specific privileges for fine-grained access control

Both approaches must include the find privilege on Collections to enable field extraction. For details on how field extraction works, see How does the connector extract field information?. For steps on creating a database user with the appropriate role, see Create database user in the setup guide.

What happens when read permission on Collections is missing?

If the database user doesn't have read permission (the find action) on Collections, only basic metadata is cataloged in Atlan:

  • Databases: Database names and basic statistics are cataloged
  • Collections: Collection names and basic statistics (document count, size, etc.) are cataloged
  • Columns: Column information isn't available because the connector can't access collection documents to derive field information

The connector requires the find privilege to read collection data and extract field information by sampling documents and running inference rules. Without read permissions, the connector can't access collection documents, so it can't extract field names, types, or nested schema information.

To get complete metadata extraction including column-level information, grant the database user the find privilege on the Collections you want to crawl. For detailed steps on configuring user permissions, see Create database user.

What does the sampling size workflow setting affect?

The Sampling Size parameter determines how many documents are sampled from each Collection during metadata extraction and field inference. This parameter is configured in the crawler settings. Adjusting this parameter affects both extraction performance and the completeness of the inferred schema:

  • Performance (job runtime): Reducing the sampling size results in quicker extraction times and lower resource utilization, making jobs complete faster and minimizing load on your MongoDB server.
  • Accuracy of field inference: Larger sampling sizes increase the likelihood of capturing all possible field names, types, and nested structures, resulting in more comprehensive and accurate metadata.

A small sample size may miss rarely occurring fields, while a large sample size provides a better schema picture but can significantly slow down extraction and increase the load on the MongoDB system.

Recommendation: Start with the default sampling size (1000) provided by the connector. If you notice that the inferred schema is missing fields or structure (especially from sparsely-used fields), incrementally increase the sampling size and re-run the workflow until you achieve accurate results without unnecessarily impacting performance.

How are inferred fields handled across workflow runs?

Fields inferred for a Collection are additive across workflow runs. This means that when you run the MongoDB connector multiple times, newly discovered fields are added to the existing field metadata, and previously discovered fields are preserved even if they aren't found in the current sampling.

This additive approach is designed to preserve any columns from past runs that may have been enriched with additional metadata, descriptions, or tags in Atlan. For example, if a field was discovered in a previous run and you've added custom descriptions or tags to it, those enrichments are maintained even if that field doesn't appear in the current document sample.

To clean previous results: If you need to remove previously inferred fields and start fresh, delete the relevant Collection assets in Atlan and rerun the workflow. This removes all existing field metadata and enables the connector to rebuild the schema from scratch based on the current document sampling.

How do I configure SSL/TLS certificates for MongoDB?

When configuring SSL/TLS for your MongoDB connection, you need to copy and paste the raw contents of your certificate files into the connection form fields. This includes the CA certificate and, if required, the certificate key file for client authentication.

Current process:

  1. Open your certificate file (CA certificate or certificate key file) in a text editor.
  2. Copy the entire contents of the file, including the header and footer lines (for example, -----BEGIN CERTIFICATE----- and -----END CERTIFICATE-----).
  3. Paste the complete contents into the corresponding field in the MongoDB connection form:
    • CA certificate field: Paste the CA certificate contents
    • Certificate key file field: Paste the client certificate key file contents (if client authentication is required)

Note: This copy-paste approach is a short-term workaround. File upload functionality is planned for a future release to provide a more streamlined user experience for SSL/TLS certificate configuration.

For detailed steps on configuring SSL/TLS during connection setup, see Choose extraction method in the crawl guide.

How do database and collection filters work?

The MongoDB connector applies database and collection filters at different stages of the metadata extraction process:

Database filters (Include/Exclude Metadata):

  • Pushed down to source: Include and Exclude database filters are applied directly at the MongoDB query level. This means only the databases that match your filter criteria are queried from your MongoDB instance. Filtered-out databases are never accessed, which reduces network traffic and improves extraction performance.

Collection regex filter (Exclude regex for collections):

  • Applied post-reading: The exclude collections regex filter is applied after Collections are read from the source system. All Collections in the included databases are first discovered and read from MongoDB, then the regex pattern is applied in Atlan to filter out matching Collection names. This means excluded Collections are still queried from MongoDB but filtered out before being cataloged in Atlan.

Performance implications:

  • Database filters reduce the workload on your MongoDB server since filtered databases aren't queried.
  • Collection regex filters don't reduce MongoDB queries but help reduce the number of assets cataloged in Atlan.

For details on configuring these filters, see Configure crawler.

Does the connector support nested fields?

Yes, the MongoDB connector supports nested fields and infers them automatically. By default, the connector extracts nested fields up to 3 levels deep.

What this means:

Consider a MongoDB document with the following structure:

{
"customer_id": "12345",
"name": "John Doe",
"address": {
"street": "123 Main St",
"city": "San Francisco",
"country": {
"code": "US",
"name": "United States"
}
},
"orders": [
{
"order_id": "ORD-001",
"total": 99.99
}
]
}

The connector extracts the following fields:

  • Level 1 (top-level fields): customer_id, name, address, orders
  • Level 2 (nested within address): address.street, address.city, address.country
  • Level 3 (nested within address.country): address.country.code, address.country.name
  • Level 2 (nested within orders array): orders.order_id, orders.total

Fields nested deeper than 3 levels (for example, address.country.region.name if it existed) aren't extracted by default. The connector uses qualified names with dot notation to represent the hierarchy, making it easy to understand the relationship between parent and nested fields in Atlan.

How are column schemas consolidated and what are the limitations?

Atlan consolidates multiple document schemas into a unified view, capturing all possible fields with their nesting levels and data types. The system performs a depth-first traversal, converting each unique field path into column entries while tracking parent-child relationships and hierarchy through qualified names.

Why is a Column's data type shown as null?

A Column's data type appears as null when the sample documents used for column metadata inference have no value for that specific column.