Mine queries through S3
Once you have crawled assets from a supported connector, you can mine query history.
Supported connectors include the following:
- Amazon Redshift
- Google BigQuery
- Hive
- Microsoft Azure Synapse Analytics
- Microsoft SQL Server
- Snowflake
- Teradata
For each of the supported connectors, Atlan supports mining query history via S3. This is useful when you have files that hold query history beyond what your source retains.
To mine lineage from these sources from S3, complete the following steps.
Structure query files
To make query history files available for Atlan, verify files:
- Use a
.jsonextension. - Aren't present in a single S3 bucket and prefix (directory).
To structure contents of files for Atlan, verify:
- Each line is a single JSON value. (The JSON object can't be pretty-formatted or span multiple lines.)
- Each SQL query is on its own line.
- Commas aren't used to separate lines.
You can also provide a default database and schema, and session IDs in the JSON.
- If a SQL query has only the name of the table or view it queries, Atlan uses the default database and schema to generate lineage for the query.
- Including the session ID speeds up lineage processing. If provided, verify that all queries belonging to the same session are next to each other in the file.
Here is an example of the JSON format. (Here it's split across multiple lines to assist reading, but remember it must all be on a single line in the file!)
{
"QUERY_TEXT": "insert into NETFLIX_DB.PUBLIC.MOVIES_FILTERED as select m.* from MOVIES m where m.RATING > 5;",
"DATABASE_NAME": "NETFLIX_DB",
"SCHEMA_NAME": "PUBLIC",
"SESSION_ID": "5c2f0a41-5d02-46f1-b9bd-ef80ad571013"
}
The name of the keys or properties in the JSON can be configured while setting up the miner package. In the example here, the default database (DATABASE_NAME) and schema (SCHEMA_NAME) are used to qualify the query for the table MOVIES as NETFLIX_DB.PUBLIC.MOVIES.
Set up S3 bucket
Query files must be available in an S3 bucket. You can either upload these files to the Atlan deployment bucket or use your own S3 bucket.
Option 1: Use Atlan S3 bucket
To avoid access issues, it's recommended to upload the required files to the same S3 bucket as Atlan. Raise a support request to get the details of your Atlan bucket and include the ARN value of the IAM user or IAM role you can provision access to.
To configure access, add the following IAM policy to the default EC2 instance role used by the Atlan EKS cluster.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"s3:GetBucketLocation",
"s3:ListBucket",
"s3:GetObject"
],
"Resource": [
"arn:aws:s3:::<bucket-name>",
"arn:aws:s3:::<bucket-name>/<prefix>/*"
]
}
]
}
- Replace
<bucket-name>with the bucket where the data is uploaded. - Replace
<prefix>with the prefix (directory) where all the files have been uploaded.
If you instead opt to use your own S3 bucket, you'll need to complete the following steps:
Option 2: Use your own S3 bucket
S3 buckets with VPC endpoints currently don't support cross-region requests. This may result in workflows not picking up objects from your bucket.
You'll first need to create a cross-account bucket policy giving Atlan's IAM role access to your bucket. A cross-account bucket policy is required since your Atlan tenant and S3 bucket may not always be deployed in the same AWS account. The permissions required for S3 bucket access include GetBucketLocation, ListBucket, and GetObject.
To create a cross-account bucket policy:
-
Raise a support ticket to get the ARN of the Node Instance Role for your Atlan EKS cluster.
-
Create a new policy to permit access by this ARN and update your bucket policy with the following:
{"Version": "2012-10-17","Statement": [{"Sid": "VisualEditor0","Effect": "Allow","Principal": {"AWS": "<role-arn>"},"Action": ["s3:GetBucketLocation","s3:ListBucket","s3:GetObject"],"Resource": ["arn:aws:s3:::<bucket-name>","arn:aws:s3:::<bucket-name>/<prefix>/*"]}]}- Replace
<role-arn>with the role ARN of Atlan's node instance role. - Replace
<bucket-name>with the name of the bucket you are creating. - Replace
<prefix>with the name of the prefix (directory) within that bucket where you'll upload the files.
- Replace
-
Once the new policy has been set up, notify the support team. Your request must include the S3 bucket name and prefix. This must be done prior to setting up the workflow so that the support team can create and attach an IAM policy for your bucket to Atlan's IAM role.
Update encryption policy
If your S3 bucket is encrypted with KMS, you'll need to update your KMS policy. This permits Atlan to decrypt objects in your S3 bucket.
-
Provide the KMS key ARN and KMS key alias ARN to the Atlan support team. The KMS key that you provide must be a customer managed KMS key. (This is because you can only change the key policy for a customer managed KMS key, and not for an AWS managed KMS key. Refer to AWS documentation to learn more.)
-
To whitelist the ARN of Atlan's node instance, update the KMS policy with the following:
{"Version": "2012-10-17","Statement": [{"Sid": "Decrypt Cross Account","Effect": "Allow","Principal": {"AWS": "<role-arn>"},"Action": ["kms:Decrypt","kms:DescribeKey"],"Resource": "*"}]}
- Replace
<role-arn>with the role ARN of Atlan's node instance role.
Select S3 miner
To select S3 miner:
- In the top right of any screen, navigate to New and then click New Workflow.
- From filters along the top, click Miner.
- From your list of packages, select your miner for your source and click on Setup Workflow.
Configure S3 miner
To configure S3 miner:
- For Connection, select the connection to mine. (To select a connection, a crawler must have already run for that source.)
- For Miner extraction method, select S3.
- Enter details for your files:
- For Bucket Name, enter the name of your S3 bucket or Atlan's bucket, including
s3://. - For Bucket Prefix, enter the S3 prefix (directory) within the bucket where the files are located.
- (Optional) For Bucket Region, enter the name of the S3 region in which the bucket exists.
- For SQL Json key, enter the JSON key containing the SQL query value. (In the example here, this was
QUERY_TEXT.) - For Default Database Json Key, enter the JSON key containing the name of the default database. (In the example here, this was
DATABASE_NAME.) - For Default Schema Json Key, enter the JSON key containing the name of the default schema. (In the example here, this was
SCHEMA_NAME.) - For Session ID Json Key, enter the JSON key containing the session ID under which the query ran. (In the example here, this was
SESSION_ID.)
- For Bucket Name, enter the name of your S3 bucket or Atlan's bucket, including
- (Optional) For Control Config, if Atlan support has provided you a custom control configuration, select Custom and enter the configuration into the Custom Config box. You can also:
- Enter
{"ignore-all-case": true}to enable crawling assets with case-sensitive identifiers.
- Enter
Run S3 miner
To run S3 miner, after completing the steps previously listed:
- To run the miner once, immediately, at the bottom of the screen click the Run button.
- To schedule miner to run hourly, daily, weekly or monthly, at the bottom of the screen click the Schedule & Run button.
Once the miner has completed running, you'll see lineage for your source's assets created by the queries in S3! 🎉
If your tenant runs on Azure and you store query files in Atlan's Azure bucket, no additional IAM or storage permissions are required.
Frequently asked questions
If I remove queries from S3 and run my miner, does it remove lineage?
No. Atlan doesn't remove lineage from older queries that are no longer in the bucket.
Does miner reprocess files in S3 prefix?
Yes. Atlan processes all files in the S3 prefix and publishes any new lineage generated. It's recommended to remove older files when updating the files in the S3 prefix.
Can I convert my existing miner to mine directly from source instead of S3?
Yes, just edit the workflow configuration. Alternatively, you can also set up another miner for the same connection.
Are database and schema name parameters always required?
The DATABASE_NAME and SCHEMA_NAME fields can be set to null if that data is already available in the query. These properties are used as a fallback option for when queries are run in the context of a certain schema or database.
Which SQL statements to add for mining lineage?
You'll need to add DDL and DML statements to the S3 miner JSON file for mining lineage. SELECT isn't required since it's a DQL statement. Both UPDATE and DELETE can be based on values from another table, so these statements are required for generating lineage.