How Atlan connects to Hive

Connect docs via MCP

Atlan connects to your Hive database to extract technical metadata while maintaining network security and compliance. You can choose between Direct connectivity for databases available from the internet or Self-deployed runtime for databases that must remain behind your firewall.

Connect via direct network connection

Atlan's Hive workflow establishes a direct network connection to your database from the Atlan SaaS tenant. This approach works when your Hive database can accept connections from the internet.

Key characteristics of Direct connectivity:

Atlan connects to your Hive database from the Atlan SaaS tenant over port 10000 (default HiveServer2 port)
You provide connection details (hostname, port, credentials, certificates) when creating a crawler workflow
For Kerberos authentication, Atlan must reach the KDC on port 88 (TCP/UDP) in addition to HiveServer2
Atlan executes read-only SQL queries to discover your database structure
Your Hive database accepts inbound network connections from Atlan's IP addresses
All credentials, keytabs, and certificates are stored encrypted in Atlan Cloud

Connect via self-deployed runtime

A runtime service deployed within your network acts as a secure bridge between Atlan Cloud and your Hive database. This approach works when your Hive database must remain fully isolated behind your firewall.

Key characteristics of Self-deployed runtime:

A runtime service sits within your network perimeter, deployed on Docker Compose or Kubernetes
The runtime maintains an outbound HTTPS connection to Atlan Cloud (port 443) and a local network connection to HiveServer2 (port 10000)
When you create a crawler workflow, Atlan Cloud sends metadata extraction requests to the runtime
The runtime translates requests into SQL queries, executes them on HiveServer2, and returns results to Atlan Cloud
Your Hive database never exposes ports to the internet—all connections are initiated from within your network

Connection details

Atlan's Hive connector uses a native Python driver to connect to HiveServer2—it doesn't use JDBC or require a Java runtime.

Driver and protocol

The connector communicates with HiveServer2 using the Impyla driver through SQLAlchemy. Impyla speaks the HiveServer2 Thrift protocol natively, supporting both binary and HTTP transport modes. This provides a direct, efficient connection without JDBC translation layers.

Connection pooling and sessions

During a metadata extraction workflow, the connector manages database connections with controlled concurrency:

Connections: Each extraction task maintains a single dedicated connection to HiveServer2
Concurrent sessions: Up to 15 extraction tasks run in parallel, each with its own connection
Query pattern: Within each task, queries run one at a time (sequential DESCRIBE EXTENDED per table). Multiple tasks may query HiveServer2 at the same time.
Schema processing: The connector processes one schema at a time, but fans out table extraction within each schema across parallel tasks

This design keeps HiveServer2 load predictable: each connection runs at most one query at a time.

Extraction fan-out architecture

The connector uses a parent-child workflow pattern to extract metadata efficiently while controlling the number of concurrent HiveServer2 connections.

Example: 10 schemas, 50 tables each

Consider a Hive instance with 10 schemas, each containing 50 tables. With the default batch size of 50 tables and 15 max concurrent activities:

The parent workflow starts with schema 1 and creates a child workflow for it.
The child workflow groups all 50 tables into 1 batch (50 tables per batch).
That batch activity opens 1 connection to HiveServer2 and runs DESCRIBE EXTENDED for each of the 50 tables sequentially.
Once schema 1 completes, the parent moves to schema 2, and repeats the same process.
This continues through all 10 schemas, one at a time.

In this scenario, the connector uses only 1 concurrent HiveServer2 connection at a time because each schema fits in a single batch.

For larger schemas (for example, 500 tables per schema), the child workflow splits tables into 10 batches of 50. The Temporal worker dispatches up to 15 of these batches in parallel, each with its own connection. This means up to 10 concurrent connections for that schema (capped at 15 by the worker).

Scenario	Schemas	Tables per schema	Batches per schema	Peak connections
Small	10	50	1	1
Medium	10	200	4	4
Large	10	500	10	10
Very large	10	1,000	20	15 (capped)

Authentication methods

The connector supports multiple authentication mechanisms, all through the same Impyla driver:

Basic authentication: Username and password, transmitted using the SASL PLAIN mechanism
Kerberos (default): Keytab-based authentication using GSSAPI. The connector obtains a Kerberos ticket via kinit before connecting.
Kerberos with TLS: Kerberos authentication with SSL/TLS encryption for the HiveServer2 connection
Kerberos with mutual TLS (MTLS): Kerberos authentication with both server and client certificate verification

How it protects your data

Hive databases contain critical business data and operational information. Atlan's connection architecture protects your environment through multiple security layers.

Metadata extraction, not data replication

Atlan extracts only structural metadata—schemas, databases, tables, views, materialized views, columns, and their relationships. The actual business data in your tables remains in your Hive database.

For example, if you have a CUSTOMERS table with customer records, Atlan discovers:

The table structure (table name, database, schema)
Column definitions (column names, data types, nullability)
Relationships (foreign keys, if configured)

Atlan never queries or stores the customer records themselves.

Read-only operations

All database queries are read-only SELECT statements. The connector can't:

Modify data (INSERT, UPDATE, DELETE)
Create or drop database objects
Change any configuration
Execute stored procedures or functions
Grant or revoke permissions

The Hive user permissions you grant control exactly what the connector can access.

Credential encryption

Hive connection credentials are encrypted at rest and in transit:

Direct connectivity:

Credentials are encrypted before storage in Atlan Cloud
Encryption keys are managed by Atlan's key management system
Credentials are decrypted only when establishing connections

Self-deployed runtime:

Basic authentication credentials never leave your network perimeter
The runtime retrieves credentials from your enterprise-managed secret vaults only when needed
Kerberos keytabs and certificates are encrypted in Atlan Cloud storage
File downloads from Atlan Cloud storage use encrypted channels

Network isolation with Self-deployed runtime

Your Hive database gains complete network isolation from the internet:

The database only accepts connections from the runtime within your local network
The runtime itself only makes outbound HTTPS connections to Atlan Cloud
No inbound connections to your network are required
Your network team can control runtime connectivity through firewall rules

Authentication security

With Kerberos authentication:

No passwords are transmitted over the network
Tickets are time-limited and automatically expire
Keytabs provide secure, non-interactive authentication
Mutual authentication verifies both client and server identity

With TLS/MTLS:

All traffic is encrypted to prevent eavesdropping
Server identity is verified through certificate validation
Client identity is verified in MTLS configurations
Certificate expiration enforces regular security reviews

Connect via direct network connection​

Connect via self-deployed runtime​

Connection details​

Driver and protocol​

Connection pooling and sessions​

Extraction fan-out architecture​

Authentication methods​

How it protects your data​

Metadata extraction, not data replication​

Read-only operations​

Credential encryption​

Network isolation with Self-deployed runtime​

Authentication security​

See also​