πŸ“œ Our Manifesto
🧰 Backup & Disaster Recovery
πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦ Customer Success & Supporty
πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦ Community

AWS Athena + Apache Hive

Guide to deploying your AWS Athena query engine with an Apache Hive metastore on Atlan

πŸ’­ This article outlines the set-up, configuration, and credential access requirements for AWS Athena with Apache Hive.

Feel free to reach out to the Atlan team for guided sessions to help with your deployment.

πŸ“œ Prerequisite: Apache Hive

Please make sure you have the following information for your existing Hive metastore. You'll need it to set up the connection.

  • Details on any authentication setup on the Hive metastore

  • VPC details (see below)

There are two ways to establish a connection to the Atlan VPC:

  1. Private connection via VPC Peering The Atlan team will need the VPC CIDR range where Hive is deployed. This is to ensure it doesn’t override the Atlan VPC.

  2. Public connection via whitelisting Atlan deployment IP The Atlan team will need the public IP for your Hive metastore. You will also need to whitelist the Atlan NAT gateway IP after deployment.

Please get in touch with the Atlan team if you need any help with these details.

πŸš€ Deploying AWS Athena + Apache Hive on Atlan

You can use the Amazon Athena data connector for external Hive metastore to query data sets in Amazon S3 that use an Apache Hive metastore.

​Atlan CF Template to connect Athena with Hive metastore. You can use this template to either create a new Hive EMR instance or connect an existing one to Athena.

OR

In the Athena management console, you can configure a Lambda function to communicate with the Hive metastore that is in your private VPC, and then connect it to the metastore. The connection from Lambda to your Hive metastore is secured by a private Amazon VPC channel and does not use the public internet.

Read Amazon's guide for more information: Using Athena Data Connector for External Hive Metastore​

Follow the steps below to establish a connection between your AWS Athena and Apache Hive.

STEP 1: Setting up a Lambda function

  • Ensure they are on the same VPC.

  • The Lambda function should be set up in the same Security group and subnet as EMR Hive.

Follow the steps in Amazon's guide: Connecting Athena to an Apache Hive Metastore​

STEP 2: Create Atlan <> Athena connection

πŸ”‘ IAM user policy

These are the permissions required by the credentials holder to get query data using Athena from Atlan.

Resource to know beforehand: 1. S3 buckets where your data resides <data_bucket> . If you are unsure add * in Resource section of the IAM policy 2. Lambda function name created in the previous step. <lambda_fn_name>. 3. Athena spill location bucket. <athena_spill_bucket>. This can be any S3 bucket where athena will store intermediate results of query. It also creates the bucket if it does not exist. 4. Athena Data Catalog name for the hive connection <athena_catalog_name>

Don't forget to replace <account_id> and <region> as well.

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"athena:GetTableMetadata",
"athena:StartQueryExecution",
"athena:GetQueryResults",
"athena:GetDatabase",
"athena:GetDataCatalog",
"athena:GetNamedQuery",
"athena:ListQueryExecutions",
"athena:StopQueryExecution",
"athena:GetQueryResultsStream",
"lambda:InvokeFunction",
"lambda:GetFunction",
"athena:ListDatabases",
"athena:GetQueryExecution",
"athena:BatchGetNamedQuery",
"athena:BatchGetQueryExecution"
],
"Resource": [
"arn:aws:lambda:<region>:<account_id>:function:<lambda_fn_name>",
"arn:aws:athena:*:<account_id>:workgroup/*",
"arn:aws:athena:<region>:<account_id>:datacatalog/<athena_catalog_name>"
]
},
{
"Sid": "VisualEditor3",
"Effect": "Allow",
"Action": [
"s3:GetBucketLocation",
"s3:GetObject",
"s3:ListBucket",
"s3:ListBucketMultipartUploads",
"s3:ListMultipartUploadParts",
"s3:AbortMultipartUpload",
"s3:CreateBucket",
"s3:PutObject"
],
"Resource": [
"arn:aws:s3:::<athena_spill_bucket>",
"arn:aws:s3:::<athena_spill_bucket>/*"
]
},
{
"Sid": "VisualEditor1",
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:ListBucket",
"s3:GetBucketLocation"
],
"Resource": [
"arn:aws:s3:::<data_bucket>",
"arn:aws:s3:::<data_bucket>/*"
]
},
{
"Sid": "VisualEditor2",
"Effect": "Allow",
"Action": [
"cloudwatch:PutMetricAlarm",
"sns:GetTopicAttributes",
"cloudwatch:DeleteAlarms",
"cloudwatch:DescribeAlarms",
"sns:ListTopics"
],
"Resource": "*"
}
]
}

STEP 3: Selecting the query engine

Crawler Status
  1. Go to the crawler page for the respective integration.

  2. Click on the "Configure" button to configure the query engine.

  3. There are two options for the query engine: AWS Athena and Presto. Select "AWS Athena".

Query Engine

STEP 4: Providing credentials

  1. You will see an option to either select a preconfigured credential or create a credential. To set up a new connection, click on the "Create Credential" button.

  2. You will be required to fill in your Presto credentials.

  3. Once you have filled in those details, click on "Next".

  4. Your connection is now created.

Congratulations! You have now integrated Atlan with your Athena πŸŽ‰