Crawl Google Cloud Knowledge Catalog
Configure and run the crawler to extract metadata from Google Cloud Knowledge Catalog. The crawler discovers Knowledge Catalog entries and their associated Aspect metadata.
Prerequisites
Before you begin, make sure you have:
- Set up Google Cloud Knowledge Catalog authentication with a service account and the required permissions
- Either a Google Cloud Service Account JSON key file, or Workload Identity Federation (WIF) credentials configured
Create crawler workflow
To crawl metadata from Google Cloud Knowledge Catalog, review the order of operations and then complete the following steps.
- In the top navigation, click Marketplace.
- Search for Google Knowledge Catalog and select it.
- Click Install.
- Once installation completes, click Setup Workflow on the same tile.
If you navigated away before installation completed, go to New > New Workflow and select Google Knowledge Catalog to proceed.
Configure authentication
-
For Connectivity, choose how you want Atlan to connect to Google Knowledge Catalog:
- Public Endpoint: Connect using the public Knowledge Catalog API endpoint from Google.
- Private Service Connect: Connect through a private endpoint. Contact Atlan support to request the DNS name of the Private Service Connect endpoint. For PSC Hostname, enter the DNS name provided.
-
Choose an authentication method:
- Service account key
- Workload Identity Federation (WIF)
- Service Account JSON: Select the Google Cloud Service Account credential with Knowledge Catalog permissions that you created during setup.
- Project ID: Enter the Google Cloud project ID associated with your service account.
After entering the authentication details, click Test Authentication to verify your configuration. If the test is successful, click Next to proceed.
- Project ID: Enter the Google Cloud project ID associated with your service account.
- Service Account Email: Enter the email address of the service account to impersonate (for example,
atlan-kc@<project-id>.iam.gserviceaccount.com). - WIF Pool Provider ID: Enter the WIF Pool Provider resource name in this format:
//iam.googleapis.com/projects/<project-number>/locations/global/workloadIdentityPools/<pool-id>/providers/<provider-id> - Atlan OAuth Client ID: Enter the OAuth Client ID from Atlan that was used when configuring the WIF provider.
- Atlan OAuth Client Secret: Enter the corresponding OAuth Client Secret.
After entering the authentication details, click Test Authentication to verify your configuration. If the test is successful, click Next to proceed.
Configure connection
Set up the connection name and access controls for your Google Cloud Knowledge Catalog data source in Atlan.
- Provide a Connection Name that represents your source environment. For example, you might use values like
production,development, orknowledge-catalog. - To change the users able to manage this connection, update the users or groups listed under Connection Admins. If you don't specify any user or group, nobody can manage the connection (not even admins).
- At the bottom of the screen, click Next to proceed.
Configure crawler
Configure which Knowledge Catalog entries to extract and which optional features to enable.
-
Connection: Select the BigQuery connection whose assets Knowledge Catalog entries are linked to. This is required—Knowledge Catalog Aspects and scan results are attached to BigQuery assets in Atlan.
-
Include Projects: (Optional) Enter one or more GCP project IDs to restrict the crawl to those projects. If not specified, all projects accessible to the service account are ingested.
-
Exclude Projects: (Optional) Enter one or more GCP project IDs to skip during the crawl.
-
Include Aspect Types: (Optional) Select specific Aspect Types to include. If specified, only entries using these Aspect Types are extracted. Leave empty to extract all Aspects.
-
Exclude Aspect Types: (Optional) Select Aspect Types to exclude. Entries using these Aspect Types are skipped during extraction.
-
Ingest Data Quality: (Optional) Enable to extract Data Quality scan results and attach them to the corresponding BigQuery assets. When enabled, the last 7 run results per scan are ingested. Requires
dataplex.datascans.list,dataplex.datascans.get, anddataplex.datascans.getDatapermissions. -
Ingest Data Profiling: (Optional) Enable to extract Data Profiling scan results and attach them to the corresponding BigQuery assets. Requires
dataplex.datascans.list,dataplex.datascans.get, anddataplex.datascans.getDatapermissions. -
Enable Aspects Reverse Sync: (Optional) Enable to permit Aspect field values edited in Atlan to be written back to Knowledge Catalog. Reverse sync targets BigQuery Tables, Views, Materialised Views, Columns, Schemas, and Routines. Disabled by default.
-
At the bottom of the screen, click Next to proceed.
Run crawler
- Click Preflight checks to validate permissions and configuration before running the crawler.
- After the preflight checks pass, you can either:
- Click Run to run the crawler once immediately.
- Click Schedule Run to schedule the crawler to run hourly, daily, weekly, or monthly.
Once the crawler has completed running, you can see the assets in Atlan's asset page. Monitor progress in the Workflows section and check the Logs tab for detailed execution information.
See also
- What does Atlan crawl from Knowledge Catalog: Learn what assets, metadata, and lineage Atlan crawls from Knowledge Catalog