Connect BigQuery to Lakehouse
BigQuery accesses Atlan's Lakehouse through external Iceberg tables you create using a provided Python script from the Lakehouse Solutions repository. The script discovers your Lakehouse namespaces and maps them to BigQuery datasets. Setup requires a BigQuery connection in GCP and a one-time storage access grant from Atlan Support. The connection type and region naming depend on whether your Atlan tenant's Lakehouse is backed by GCS or by S3:
- GCS-backed tenants use a Cloud Resource connection and a standard region name (for example,
us-east-1). - S3-backed tenants use a BigLake on AWS connection (BigQuery Omni) and an AWS-prefixed location (for example,
aws-us-east-1).
If you don't know which type your tenant uses, contact Atlan Support.
Prerequisites
Before you begin, make sure that:
-
You have installed the required Python dependencies:
pip install "pyiceberg[pyarrow]" google-cloud-bigquery -
All four resources are in the same region: the storage bucket, BigQuery connection, BigQuery dataset, and query execution location. Region mismatches cause location-related errors. For S3-backed tenants, the BigQuery connection, dataset, and query execution location must use the AWS-prefixed form (for example,
aws-us-east-1).
Set up external tables in BigQuery
Before you begin, in your Atlan workspace, navigate to Workflow > Marketplace > Atlan Lakehouse > View connection details and note your catalog URI, catalog name, storage region, OAuth client ID, and OAuth client secret. If you don't see them, contact Atlan Support.
- GCP (GCS)
- AWS (S3)
For GCS-backed Lakehouse tenants.
-
In your GCP project, create a Cloud Resource connection in the same region as the GCS data provided by Atlan.
- BigQuery UI
- CLI
- In the BigQuery console, navigate to Explorer > your project > + Add data.
- Search for cloud resource, then select Vertex AI > BigQuery federation.
- Enter a connection name (for example,
atlan-mdlh-conn) and select the region provided by Atlan. - On the connection info page, note the Service Account ID.
Run the following command to create the connection:
bq mk --connection \--project_id=<PROJECT_ID> \--location=<REGION> \--connection_type=CLOUD_RESOURCE \atlan-mdlh-connThen retrieve the Service Account ID:
bq show --connection --location=<REGION> <CONNECTION_ID> -
Send the Service Account ID to Atlan Support. Atlan grants the service account the Storage Object Viewer role (
roles/storage.objectViewer) on the Lakehouse GCS bucket. This role includesstorage.objects.list—required by BigQuery for file pattern expansion on external tables—andstorage.objects.getfor reading the data. You'll receive a confirmation email once access is enabled—continue with Configure and run script at that point.noteIf the role hasn't been granted yet, BigQuery queries on Lakehouse tables can fail with:
Permission 'storage.objects.list' denied on resource (or it may not exist).To set up multiple connections for different environments (for example, separate
BQ_CONNECTION_IDvalues for dev and prod), include all service account IDs in a single support ticket. All connections access the same Lakehouse snapshot with no isolation between environments.
For S3-backed Lakehouse tenants.
Before proceeding, you must have received the IAM Role ARN and AWS region from Atlan. If you haven't received them, open a ticket with Atlan Support, specify your Atlan tenant URL (for example, yourcompany.atlan.com), and request BigQuery on Lakehouse access for an AWS-backed tenant.
-
In your GCP project, enable the BigQuery Connection API and the BigLake API. Both are required for BigLake on AWS (BigQuery Omni) connections.
-
Create a BigLake on AWS connection in the same
aws-<region>location as the S3 data provided by Atlan, using the IAM Role ARN provided by Atlan.- BigQuery UI
- CLI
- In the BigQuery console, navigate to Explorer > your project > + Add data.
- Search for Amazon S3, then select BigQuery Omni federation.
- Enter a connection name (for example,
atlan-mdlh-conn). - For Connection location type, select AWS and choose the location provided by Atlan (for example,
aws-us-east-1). - Paste the IAM Role ARN provided by Atlan.
- On the connection info page, note the BigQuery Google identity (a numeric subject value).
Run the following command to create the connection. Use the AWS-prefixed location (for example,
aws-us-east-1) provided by Atlan:bq mk --connection \--project_id=<PROJECT_ID> \--location=aws-<REGION> \--connection_type=AWS \--properties='{"crossAccountRole": "<ATLAN_IAM_ROLE_ARN>"}' \atlan-mdlh-connThen retrieve the Google identity of the connection:
bq show --connection --location=aws-<REGION> <PROJECT_ID>.aws-<REGION>.<CONNECTION_ID>Look for the
identityfield in the output—it's a numeric subject value. -
Send the BigQuery Google identity value, along with your Atlan tenant URL, to Atlan Support. Atlan adds this identity to the IAM role's trust policy, allowing BigQuery to assume the role and read the Lakehouse S3 bucket. You'll receive a confirmation email once access is enabled—continue with Configure and run script at that point.
noteIf the trust policy hasn't been updated yet, BigQuery queries on Lakehouse tables can fail with:
Connection ... isn't authorized to assume into your AWS IAM Role.To set up multiple connections for different environments (for example, separate
BQ_CONNECTION_IDvalues for dev and prod), include all Google identity values in a single support ticket. All connections access the same Lakehouse snapshot with no isolation between environments.
Configure and run script
Once Atlan has confirmed that data sharing is enabled:
-
Download
bq_external_iceberg_tables_create_refresh.pyfrom the Lakehouse Solutions repository. Set the following values as environment variables or edit them directly in the script:BQ_PROJECT_ID: GCP project ID where the connection was createdBQ_LOCATION: Storage/BigQuery region (must match across all resources). For S3-backed tenants, use the AWS-prefixed form (for example,aws-us-east-1).BQ_CONNECTION_ID: Connection name created in the previous sectionPOLARIS_CATALOG_URI: Catalog URI provided by AtlanCLIENT_ID: OAuth Client ID provided by AtlanCLIENT_SECRET: OAuth Client Secret provided by AtlanENABLE_HISTORY_NAMESPACE_SYNC: Set totrueto include theatlan-historynamespace (default:false)
-
In your terminal, run the script:
python bq_external_iceberg_tables_create_refresh.pyThe script autodetects the Polaris warehouse, discovers all namespaces and tables, and creates BigQuery datasets per namespace.
Example: To verify the setup and query metadata for assets registered in Atlan:
SELECT *FROM `<PROJECT_ID>.gold.assets`LIMIT 10;Dataset names follow BigQuery naming rules, hyphens are converted to underscores (for example,
atlan-nsbecomesatlan_ns). External Iceberg tables are created or replaced in each dataset usingCREATE OR REPLACE EXTERNAL TABLE, so the script is safe to re-run for both initial setup and ongoing refresh.
Refresh external tables
The external tables don't sync automatically. Re-run the script periodically to keep them up to date with the latest Lakehouse data.
-
In your terminal, run the script:
python bq_external_iceberg_tables_create_refresh.py -
Schedule the script to run on a recurring basis:
- Maximum frequency: No more than once every 30 minutes
Troubleshooting
If you have any issues configuring or querying external Iceberg tables in BigQuery, see Troubleshooting BigQuery errors.
Next steps
Now that BigQuery is connected to Lakehouse, you can:
- Query Atlan metadata from BigQuery: See the available metadata tables in Entity metadata reference.
- Use cases: Explore popular patterns such as metadata enrichment tracking, lineage impact analysis, and glossary alignment in Use cases.