Crawl Google BigQuery
Extract metadata assets from your Google BigQuery data warehouse into Atlan.
Prerequisites
Before you begin, verify you have:
- Completed the Set up Google BigQuery guide or Set up Workload Identity Federation
- Access to your Google Cloud project and BigQuery
- Reviewed the order of operations
Create crawler workflow
Create a new workflow and select Google BigQuery as your connector source.
- In the top-right corner of any screen, select New > New Workflow.
- From the list of packages, select BigQuery Assets > Setup Workflow.
Configure extraction
When setting up metadata extraction from your Google BigQuery data, choose how Atlan connects and extracts metadata. Select the extraction method that best fits your organization's security and network requirements:
- Direct
- Agent
Atlan SaaS connects directly to Google BigQuery (typically via the public BigQuery API or Private Service Connect). This method supports multiple authentication options and lets you test the connection before proceeding.
-
For Connectivity, choose how you want Atlan to connect to Google BigQuery:
- Public Network: Connect using the public BigQuery API endpoint from Google.
- Private Network Link: Connect through a private endpoint. Contact Atlan support to request the DNS name of the Private Service Connect endpoint. For Host, enter the DNS name in the format
https://bigquery-<privateserver>.p.googleapis.com. Replace<privateserver>with the DNS name. For Port,443is the default.
-
Choose an authentication method for your direct connection.
- Service account
- Workload Identity Federation
-
Use a service account key for authentication.
- Project Id: Enter the value of
project_idfrom the JSON for the service account you created. This project ID is used to authenticate the connection. You can configure the crawler to extract more than the specified project. - Service Account Json: Paste the entire JSON for the service account you created.
- Service Account Email: Enter the value of
client_emailfrom the JSON for the service account you created.
- Project Id: Enter the value of
-
After entering the authentication details, click Test Authentication to verify your configuration. If the test is successful, click Next to proceed with the connection configuration.
-
Authenticate using Workload Identity Federation.
-
Project Id: Enter your Google Cloud project ID. This project ID is used to authenticate the connection. You can configure the crawler to extract more than the specified project.
-
Service Account Email: Enter the email of the service account that has BigQuery permissions and is configured for WIF impersonation.
-
WIF Pool Provider Id: Enter the full resource name of your WIF provider in the following format:
//iam.googleapis.com/projects/<project-number>/locations/global/workloadIdentityPools/<pool-id>/providers/<provider-id> -
Atlan OAuth Client Id: Enter the OAuth Client ID you created in Atlan during WIF setup.
-
Atlan OAuth Client Secret: Enter the OAuth Client Secret you created in Atlan.
-
-
After entering the authentication details, click Test Authentication to verify your configuration. If the test is successful, click Next to proceed with the connection configuration.
Atlan's Secure Agent application is deployed within your organization and connects to Google BigQuery. This method provides additional security by keeping connections within your network perimeter.
-
Install Self-Deployed Runtime if you haven't already:
-
For Connectivity, choose how you want Atlan to connect to Google BigQuery:
- Public Network: Connect using the public BigQuery API endpoint from Google.
- Private Network Link: Connect through a private endpoint. Contact Atlan support to request the DNS name of the Private Service Connect endpoint. For Host, enter the DNS name in the format
https://bigquery-<privateserver>.p.googleapis.com. Replace<privateserver>with the DNS name. For Port,443is the default.
-
Choose an authentication method for your agent-based connection.
- Service account
- Workload Identity Federation
- Workload Identity Federation for GKE
-
Use a service account key for authentication.
- Project Id: Enter the value of
project_idfrom the JSON for the service account you created. - Service Account Json: Paste the entire JSON for the service account you created.
- Service Account Email: Enter the value of
client_emailfrom the JSON for the service account you created.
- Project Id: Enter the value of
-
Click Next to proceed with the connection configuration.
-
Authenticate using Workload Identity Federation from within your agent environment.
-
Project Id: Enter your Google Cloud project ID.
-
Service Account Email: Enter the email of the service account that has BigQuery permissions and is configured for WIF impersonation.
-
WIF Pool Provider Id: Enter the full resource name of your WIF provider in the following format:
//iam.googleapis.com/projects/<project-number>/locations/global/workloadIdentityPools/<pool-id>/providers/<provider-id> -
Atlan OAuth Client Id: Enter the OAuth Client ID you created in Atlan during WIF setup.
-
Atlan OAuth Client Secret: Enter the OAuth Client Secret you created in Atlan.
-
-
Click Next to proceed with the connection configuration.
This method is available when the Self-Deployed Runtime is deployed on a Google Kubernetes Engine (GKE) cluster. The Kubernetes Service Account that runs the Secure Agent application impersonates an IAM service account that has the required permissions on Google BigQuery.
For more information, see Workload Identity Federation for GKE.
-
Configure Workload Identity Federation: Set up the federation between your GKE cluster and Google Cloud IAM. For detailed instructions, see Workload Identity Federation with Kubernetes—use service account impersonation.
-
Enter the following details:
- IAM service account name: Enter the IAM service account name that the Kubernetes Service Account impersonates. Remove the
.gserviceaccount.comsuffix if present. - Project Id: Enter your Google Cloud project ID.
- Connectivity: Choose Public Network or Private Network Link as needed.
- IAM service account name: Enter the IAM service account name that the Kubernetes Service Account impersonates. Remove the
-
Click Next to proceed with the connection configuration.
Configure connection
Set up the connection name and access controls for your Google BigQuery data source in Atlan.
- Provide a Connection Name that represents your source environment. For example, you might use values like
production,development,gold, oranalytics. - To change the users able to manage this connection, update the users or groups listed under Connection Admins. If you don't specify any user or group, nobody can manage the connection (not even admins).
- To prevent users from querying Google BigQuery data, set Allow SQL Query to No.
- To prevent users from previewing Google BigQuery data, set Allow Data Preview to No.
- At the bottom of the screen, click Next to proceed.
Configure crawler
Before running the crawler, you can configure which assets to include or exclude and other crawler options. Include and exclude metadata filters are only available when using the direct extraction method. If an asset appears in both the include and exclude filters, the exclude filter takes precedence.
- For Filter Sharded Tables, keep No for the default configuration or click Yes to enable Atlan to catalog and display sharded tables with the same naming prefix as a single table in asset discovery and the lineage graph.
- To exclude specific assets from crawling, select Exclude Metadata. This defaults to no assets if none are specified.
- To include specific assets in crawling, select Include Metadata. This defaults to all assets if none are specified.
- To ignore tables and views based on a naming convention, specify a regular expression in the Exclude regex for tables & views field.
- To import existing tags from Google BigQuery to Atlan, for Import Tags, click Yes.
- For Advanced Config, keep Default for the default configuration or click Custom if Atlan support has provided you with a custom control configuration:
- Enter the configuration into the Custom Config box. You can also enter
{"ignore-all-case": true}to enable crawling assets with case-sensitive identifiers. - For Hidden Assets, keep No for the default configuration or click Yes to crawl metadata from your hidden datasets in Google BigQuery.
- Enter the configuration into the Custom Config box. You can also enter
Run crawler
- Direct
- Agent
- Click Preflight checks to validate permissions and configuration before running the crawler. This helps identify any potential issues early.
- After the preflight checks pass, you can either:
- Click Run to run the crawler once immediately.
- Click Schedule Run to schedule the crawler to run hourly, daily, weekly, or monthly.
You can either:
- Click Run to run the crawler once immediately.
- Click Schedule Run to schedule the crawler to run hourly, daily, weekly, or monthly.
Once the crawler has completed running, you can see the assets in Atlan's asset page.
See also
- How Atlan connects to Google BigQuery: Connectivity, authentication, and data access patterns
- What does Atlan crawl from Google BigQuery: Learn about the Google BigQuery assets and metadata that Atlan discovers and catalogs
- Preflight checks for Google BigQuery: Verify permissions and configuration before crawling
- Troubleshooting Google BigQuery connectivity: Resolve common connection issues