Generate lineage from S3 queries App
Atlan enables you to extract lineage from SQL query history stored in Amazon S3 by using the Generic Miner app. This lets you generate table-level and column-level lineage for an existing connection when the source system no longer exposes the complete query history or when query logs are retained externally for long-term storage.
Prerequisites
Before you begin, make sure you have:
- Access to the Generic Miner app. You can verify this by searching for Generic Miner in the Atlan marketplace. If you don't have access, contact Atlan support and request that the app be added to your tenant.
- A connection in Atlan for the warehouse or database whose queries you want to mine (for example, Snowflake, Postgres, or another supported engine).
- Query history files available in an S3 bucket and prefix. Follow the steps in Mine queries through S3 to prepare the JSON files and configure S3 access.
- The JSON structure for each query, including the SQL text and, optionally, default database, default schema, and session ID fields. See Mine queries through S3 for an example JSON structure.
Configure Miner workflow
-
In your Atlan workspace, go to the homepage and click New workflow in the top navigation bar.
-
From the workflow list, search for Generic Miner and then click Set up workflow.
-
In the Workflow name field, enter a descriptive name such as:
snowflake-prod-generic-miner -
For Connection qualified name, paste the qualified name of the connection you want to mine. You can copy this from the connection’s qualifiedName field in Atlan.
-
For Parsing language, select the SQL dialect that matches the source system that generated the queries, such as Amazon Athena, MySQL, Oracle, PostgreSQL, Snowflake SQL, or Trino. This helps Generic Miner parse statements accurately.
Configure S3 for mining
Once you configure the workflow basics, point Generic Miner to the S3 bucket and prefix that contain your query JSON files.
-
For Miner extraction method, select S3 (offline). Generic Miner always runs in offline mode and reads queries from files stored in S3 rather than connecting directly to the source.
-
In Bucket name, enter the name of the S3 bucket that contains your query files. This can be either Atlan’s managed bucket or your own bucket, depending on how you set up Mine queries through S3.
-
Under Bucket prefix, specify the prefix (folder path) under which all query files are stored. Use the same prefix that you configured when exporting queries to S3.
-
For Bucket region, provide the AWS region of your S3 bucket (for example,
us-east-1) if you use your own bucket. If you are using Atlan’s bucket, you can usually leave this field blank.
Map JSON keys for queries
Configure how Generic Miner reads individual fields from each JSON object so it can correctly interpret SQL statements and their execution context.
-
For SQL JSON key, enter the name of the JSON key that contains the SQL text for each query. The default value is
QUERY_TEXT.Use
QUERY_TEXTif your JSON follows the structure described in Mine queries through S3. If your JSON uses a different field name (for example,statementorquery), enter that key instead. The value must exactly match the key that contains the SQL string in your JSON. -
Under Default database JSON key, specify the JSON key that contains the default database used when object names in the SQL aren't fully qualified. The default value is
DATABASE_NAME.If your JSON stores the database name under a different key (for example,
default_db), enter that key instead. This value helps Generic Miner correctly resolve table and view references when object names aren't fully qualified. -
In Default schema JSON key, provide the JSON key that specifies the default schema name. A common value is
SCHEMA_NAME. This value helps Generic Miner qualify object names when the schema isn't explicitly specified in the SQL. -
Use Session ID JSON key to point to the JSON key that contains the session identifier for the query. A common value is
SESSION_ID.Grouping queries by session can improve lineage accuracy for related or sequential statements. This example shows a JSON object that uses the default key names.
{
"QUERY_TEXT": "insert into NETFLIX_DB.PUBLIC.MOVIES_FILTERED as select m.* from MOVIES m where m.RATING > 5;",
"DATABASE_NAME": "NETFLIX_DB",
"SCHEMA_NAME": "PUBLIC",
"SESSION_ID": "5c2f0a41-5d02-46f1-b9bd-ef80ad571013"
}
For more detailed guidance on preparing query files and S3 access, see Mine queries through S3.
Run and verify workflow
-
After completing the configuration, click Run to start the Generic Miner workflow. Wait for the workflow to finish. You can monitor the workflow status and review logs from the workflow run page.
-
To verify lineage, open one of the tables or views mined by the workflow in Atlan and navigate to the Lineage tab. Confirm that the upstream and downstream relationships include the queries mined from S3.
See also
- Mine queries through S3: Prepare query history files and configure S3 access.
- View lineage: Learn how to explore the lineage created by Generic Miner.