Crawl Apache Kafka
Extract metadata assets from your Apache Kafka cluster into Atlan.
Prerequisites
Before you begin, complete the following prerequisites:
- Apache Kafka setup: You've configured the Apache Kafka permissions needed for Atlan to connect to your cluster.
- Schema Registry setup (if crawling schemas): You've completed the Confluent Schema Registry setup and have the Schema Registry endpoint, API key, and API secret ready.
- Order of operations: Review the order of operations to understand the sequence of tasks for crawling metadata.
- Access to Atlan workspace: You have the required permissions in Atlan to create and manage a connection.
Create crawler workflow
- In Atlan, select New > New Workflow.
- Select Apache Kafka Assets and click Setup Workflow.
Configure extraction
Select your extraction method and provide the connection details for your Apache Kafka cluster.
- Direct
- Agent
- Offline
Atlan connects directly to your Apache Kafka cluster and crawls metadata over the network.
-
For Bootstrap servers, enter one or more hostnames of your Apache Kafka brokers. For multiple hostnames, separate each entry with a comma
,or semicolon;. -
For Authentication, choose the method that matches your cluster configuration:
- No Auth -- select this if your cluster doesn't require authentication.
- Basic -- enter the username and password configured for Atlan using SASL/PLAIN.
- SCRAM -- enter the username and password and choose the SCRAM mechanism (SCRAM-SHA-256 or SCRAM-SHA-512).
- mTLS -- upload the client certificate and private key for mutual TLS authentication.
-
For Security protocol, select Plaintext or SSL for No Auth, and SASL_PLAINTEXT or SASL_SSL for Basic and SCRAM authentication.
-
To crawl Schema Registry subjects alongside Kafka, set Include Schema Registry to True and provide the following details:
- For Schema registry host, enter the URL of your Schema Registry endpoint (for example,
https://psrc-xxxxx.us-east-2.aws.confluent.cloud). - For API Key, enter the Schema Registry API key you created.
- For API Secret, enter the Schema Registry API secret you created.
- For Schema registry host, enter the URL of your Schema Registry endpoint (for example,
-
Click Test Authentication to confirm connectivity, then click Next.
Self-Deployed Runtime executes metadata extraction within your organization's environment, keeping all connections inside your network perimeter.
-
Install Self-Deployed Runtime if you haven't already:
-
Confirm the runtime can reach your Apache Kafka cluster over your local network and that network security is configured.
-
Under Secure Agent Configuration, select your deployed agent from the Agent dropdown and the secret store from the Secret Store dropdown.
-
For Bootstrap servers, enter one or more hostnames of your Apache Kafka brokers as reachable from within your network.
-
For Authentication, choose the method that matches your cluster configuration:
- No Auth -- select this if your cluster doesn't require authentication.
- Basic -- reference the secret store path for the username and password configured for Atlan using SASL/PLAIN.
- SCRAM -- reference the secret store path for the username and password and choose the SCRAM mechanism (SCRAM-SHA-256 or SCRAM-SHA-512).
- mTLS -- reference the secret store paths for the client certificate and private key.
-
For Security protocol, select Plaintext or SSL for No Auth, and SASL_PLAINTEXT or SASL_SSL for Basic and SCRAM authentication.
-
To crawl Schema Registry subjects alongside Kafka, set Include Schema Registry to True and provide the following details:
- For Schema registry host, enter the URL of your Schema Registry endpoint (for example,
https://psrc-xxxxx.us-east-2.aws.confluent.cloud). - For API Key, reference the secret store path where the Schema Registry API key is stored.
- For API Secret, reference the secret store path where the Schema Registry API secret is stored.
- For Schema registry host, enter the URL of your Schema Registry endpoint (for example,
-
Store sensitive credential values in your secret store and reference them in the corresponding fields. For more information, see Configure secrets for workflow execution.
-
Click Next after completing the configuration.
Atlan ingests metadata you extract and upload to object storage using the offline extraction method. First extract the metadata yourself, then make it available in object storage.
- For Bucket name, enter the name of your S3 or GCS bucket.
- For Bucket prefix, enter the prefix under which all metadata files exist. These include
topics.jsonandtopic-configs.json. - Based on your cloud platform, enter the following details:
- AWS -- for Role ARN, enter the ARN of the AWS role to assume when copying files from S3.
- Azure -- enter your Azure Storage Account name and the SAS token for Blob SAS Token.
- Google Cloud -- grant Atlan's tenant service account the Storage Object Viewer role on your GCS bucket. Contact Atlan support to get your tenant's service account name, then grant the Storage Object Viewer role.
- At the bottom of the screen, click Next.
Configure connection
Set up the connection identity and access controls for your Apache Kafka source.
- Provide a Connection Name that represents your source environment -- for example,
production,development,gold, oranalytics. - Under Connection Admins, add the users or groups that can manage this connection. If you leave this empty, no one can manage the connection, including admins.
- At the bottom of the screen, click Next.
Configure crawling options
On the Metadata page, you can override the defaults for any of these options. If an asset appears in both include and exclude filters, the exclude filter takes precedence. When Schema Registry credentials are provided, the topic include/exclude regex also applies to schema subjects. Subjects are matched using their base topic name (stripping the -key or -value suffix).
- For Skip internal topics, keep the default Yes to skip internal Apache Kafka topics, or select No to crawl them.
- Click Exclude topics regex to exclude specific topics. Defaults to no exclusions if none are specified.
- Click Include topics regex to limit crawling to specific topics. Defaults to all topics if none are specified.
Run crawler
After configuring all options, run or schedule the crawler.
- For Direct extraction, click Preflight checks to validate permissions and configuration before running. For Agent and Offline extraction, skip this step.
- Click Run to run the crawler once immediately, or click Schedule & Run to schedule the crawler to run hourly, daily, weekly, or monthly.
Once the crawler completes, the assets appear on Atlan's asset page.
See also
- What does Atlan crawl from Apache Kafka: Assets and metadata discovered during crawling
- Preflight checks for Apache Kafka: Validation checks for permissions and configuration