Crawl Hive
Extract metadata assets from your Hive database into Atlan.
Prerequisites
Before you begin, verify you have:
- Configured the Hive permissions (and set up a private network link if using private network)
- Access to your Hive instance and credentials
- Reviewed the order of operations
Create crawler workflow
Create a new workflow and select Hive as your connector source.
- In the top right of any screen in Atlan, navigate to New > New Workflow.
- From the Marketplace page, click Hive Assets > Setup Workflow.
Configure extraction
When setting up metadata extraction from your Hive database, choose how Atlan connects and extracts metadata. Select the extraction method that best fits your organization's security and network requirements:
- Direct
- Agent
- Offline
Atlan SaaS connects directly to your Hive database (HiveServer2). This method supports Basic and Kerberos authentication and lets you test the connection before proceeding.
- Choose an authentication type that matches your Hive configuration.
- Basic authentication
- Kerberos authentication
- Select the Basic authentication type.
- For Host Name, enter the host name (or PrivateLink endpoint) for your Hive instance.
- For Port, enter the port number for your Hive instance (default: 10000).
- For Username, enter the username you created for that instance.
- For Password, enter the password for the username.
- For Default Schema, enter the default schema name for your Hive instance.
- Select the Kerberos authentication type.
- For Host Name, enter the host name (or PrivateLink endpoint) for your Hive instance.
- For Port, enter the port number for your Hive instance (default: 10000).
- For Kerberos Principal, enter your user principal in the format
username@REALM(for example,[email protected]). - For Service Name, enter the Hive service principal name (typically
hive). - For Keytab File, upload the keytab file you generated during setup.
- For krb5.conf File, upload your Kerberos configuration file.
- For Default Schema, enter the default schema name for your Hive instance.
- Select the security type:
- Default (No TLS)
- TLS
- mTLS
No additional configuration required. The connection isn't encrypted. Use only in trusted internal networks.
- Select TLS as the security type.
- For CA Certificate File, upload the CA certificate that signed your HiveServer2's SSL certificate (PEM, CRT, or
.ziparchive).
- Select MTLS as the security type.
- For CA Certificate File, upload the CA certificate that signed your HiveServer2's SSL certificate.
- For Client Certificate File, upload your client certificate.
- For Client Key File, upload your client private key.
- For Client Key Passphrase, enter the passphrase if your client key is encrypted (optional).
- All certificates must be in PEM or CRT format, or uploaded as
.ziparchives.
- Click Test Authentication to confirm connectivity to Hive. When successful, click Next to proceed with the connection configuration.
Self-Deployed Runtime runs within your organization and connects to your Hive database. This method keeps connections inside your network perimeter.
-
Install Self-Deployed Runtime if you haven't already:
-
Select the Agent tab and configure the Hive data source by adding the secret keys for your secret store.
-
Choose an authentication type that matches your Hive configuration:
- Basic authentication
- Kerberos authentication
- Select the Basic authentication type.
- For Host Name, enter the host name (or PrivateLink endpoint) for your Hive instance.
- For Port, enter the port number for your Hive instance (default: 10000).
- For Username, enter the username you created for that instance (or reference the secret key where it's stored).
- For Password, enter the password for the username (or reference the secret key where it's stored).
- For Default Schema, enter the default schema name for your Hive instance.
- Select the Kerberos authentication type.
- For Host Name, enter the host name (or PrivateLink endpoint) for your Hive instance.
- For Port, enter the port number for your Hive instance (default: 10000).
- For Kerberos Principal, enter your user principal in the format
username@REALM(for example,[email protected]). - For Service Name, enter the Hive service principal name (typically
hive). - For Keytab File, upload the keytab file you generated during setup.
- For krb5.conf File, upload your Kerberos configuration file.
- For Default Schema, enter the default schema name for your Hive instance.
- Select the security type:
- Default (No TLS)
- TLS
- mTLS
No additional configuration required. The connection isn't encrypted. Use only in trusted internal networks.
- Select TLS as the security type.
- For CA Certificate File, upload the CA certificate that signed your HiveServer2's SSL certificate (PEM, CRT, or
.ziparchive).
- Select MTLS as the security type.
- For CA Certificate File, upload the CA certificate that signed your HiveServer2's SSL certificate.
- For Client Certificate File, upload your client certificate.
- For Client Key File, upload your client private key.
- For Client Key Passphrase, enter the passphrase if your client key is encrypted, or reference the secret key (optional).
- All certificates must be in PEM or CRT format, or uploaded as
.ziparchives.
-
Complete the configuration by following How to configure Self-Deployed Runtime for workflow execution.
-
Click Next after completing the configuration.
Atlan can ingest metadata that you extract and upload to S3 using the offline extraction method. First extract the metadata yourself, then make it available in S3.
- For Bucket name, enter the name of your S3 bucket or Atlan's bucket.
- For Bucket prefix, enter the S3 prefix under which all the metadata files exist (for example,
databases.json,columns-<database>.json). - For Bucket region, enter the S3 region name.
- Click Next at the bottom of the screen.
Configure connection
Set up the connection name and access controls for your Hive data source in Atlan.
- Provide a Connection Name that represents your source environment. For example, you might use values like
production,development,gold, oranalytics. - To change the users able to manage this connection, update the users or groups listed under Connection Admins. If you don't specify any user or group, nobody can manage the connection (not even admins).
- At the bottom of the screen, click Next to proceed.
Configure crawler
Before running the crawler, you can configure which assets to include or exclude. On the Metadata page:
- To exclude specific assets from crawling, click Exclude Metadata. This defaults to no assets if none are specified.
- To include specific assets in crawling, click Include Metadata. This defaults to all assets if none are specified.
If an asset appears in both the include and exclude filters, the exclude filter takes precedence.
Run crawler
- Direct
- Agent
- Offline
- Click Preflight checks to validate permissions and configuration before running the crawler. This helps identify any potential issues early.
- After the preflight checks pass, you can either:
- Click Run to run the crawler once immediately.
- Click Schedule Run to schedule the crawler to run hourly, daily, weekly, or monthly.
You can either:
- Click Run to run the crawler once immediately.
- Click Schedule Run to schedule the crawler to run hourly, daily, weekly, or monthly.
You can either:
- Click Run to run the crawler once immediately.
- Click Schedule Run to schedule the crawler to run hourly, daily, weekly, or monthly.
Once the crawler has completed running, you can see the assets on Atlan's asset page.
Once you have crawled assets from Hive, you can run the Hive Miner to mine query history through S3.
See also
- How Atlan connects to Hive: Connectivity, authentication, and data access patterns
- What does Atlan crawl from Hive: Metadata and assets discovered during crawling
- Preflight checks for Hive: Verify prerequisites before crawling
- Troubleshooting Hive connectivity: Resolve common connection issues