How Atlan connects to Hive
Atlan connects to your Hive database to extract technical metadata while maintaining network security and compliance. You can choose between Direct connectivity for databases available from the internet or Self-deployed runtime for databases that must remain behind your firewall.
Connect via direct network connection
Atlan's Hive workflow establishes a direct network connection to your database from the Atlan SaaS tenant. This approach works when your Hive database can accept connections from the internet.
Key characteristics of Direct connectivity:
- Atlan connects to your Hive database from the Atlan SaaS tenant over port 10000 (default HiveServer2 port)
- You provide connection details (hostname, port, credentials, certificates) when creating a crawler workflow
- For Kerberos authentication, Atlan must reach the KDC on port 88 (TCP/UDP) in addition to HiveServer2
- Atlan executes read-only SQL queries to discover your database structure
- Your Hive database accepts inbound network connections from Atlan's IP addresses
- All credentials, keytabs, and certificates are stored encrypted in Atlan Cloud
Connect via self-deployed runtime
A runtime service deployed within your network acts as a secure bridge between Atlan Cloud and your Hive database. This approach works when your Hive database must remain fully isolated behind your firewall.
Key characteristics of Self-deployed runtime:
- A runtime service sits within your network perimeter, deployed on Docker Compose or Kubernetes
- The runtime maintains an outbound HTTPS connection to Atlan Cloud (port 443) and a local network connection to HiveServer2 (port 10000)
- When you create a crawler workflow, Atlan Cloud sends metadata extraction requests to the runtime
- The runtime translates requests into SQL queries, executes them on HiveServer2, and returns results to Atlan Cloud
- Your Hive database never exposes ports to the internet—all connections are initiated from within your network
Connection details
Atlan's Hive connector uses a native Python driver to connect to HiveServer2—it doesn't use JDBC or require a Java runtime.
Driver and protocol
The connector communicates with HiveServer2 using the Impyla driver through SQLAlchemy. Impyla speaks the HiveServer2 Thrift protocol natively, supporting both binary and HTTP transport modes. This provides a direct, efficient connection without JDBC translation layers.
Connection pooling and sessions
During a metadata extraction workflow, the connector manages database connections with controlled concurrency:
- Connections: Each extraction task maintains a single dedicated connection to HiveServer2
- Concurrent sessions: Up to 15 extraction tasks run in parallel, each with its own connection
- Query pattern: Within each task, queries run one at a time (sequential DESCRIBE EXTENDED per table). Multiple tasks may query HiveServer2 at the same time.
- Schema processing: The connector processes one schema at a time, but fans out table extraction within each schema across parallel tasks
This design keeps HiveServer2 load predictable: each connection runs at most one query at a time.
Extraction fan-out architecture
The connector uses a parent-child workflow pattern to extract metadata efficiently while controlling the number of concurrent HiveServer2 connections.
Example: 10 schemas, 50 tables each
Consider a Hive instance with 10 schemas, each containing 50 tables. With the default batch size of 50 tables and 15 max concurrent activities:
- The parent workflow starts with schema 1 and creates a child workflow for it.
- The child workflow groups all 50 tables into 1 batch (50 tables per batch).
- That batch activity opens 1 connection to HiveServer2 and runs DESCRIBE EXTENDED for each of the 50 tables sequentially.
- Once schema 1 completes, the parent moves to schema 2, and repeats the same process.
- This continues through all 10 schemas, one at a time.
In this scenario, the connector uses only 1 concurrent HiveServer2 connection at a time because each schema fits in a single batch.
For larger schemas (for example, 500 tables per schema), the child workflow splits tables into 10 batches of 50. The Temporal worker dispatches up to 15 of these batches in parallel, each with its own connection. This means up to 10 concurrent connections for that schema (capped at 15 by the worker).
| Scenario | Schemas | Tables per schema | Batches per schema | Peak connections |
|---|---|---|---|---|
| Small | 10 | 50 | 1 | 1 |
| Medium | 10 | 200 | 4 | 4 |
| Large | 10 | 500 | 10 | 10 |
| Very large | 10 | 1,000 | 20 | 15 (capped) |
Authentication methods
The connector supports multiple authentication mechanisms, all through the same Impyla driver:
- Basic authentication: Username and password, transmitted using the SASL PLAIN mechanism
- Kerberos (default): Keytab-based authentication using GSSAPI. The connector obtains a Kerberos ticket via
kinitbefore connecting. - Kerberos with TLS: Kerberos authentication with SSL/TLS encryption for the HiveServer2 connection
- Kerberos with mutual TLS (MTLS): Kerberos authentication with both server and client certificate verification
How it protects your data
Hive databases contain critical business data and operational information. Atlan's connection architecture protects your environment through multiple security layers.
Metadata extraction, not data replication
Atlan extracts only structural metadata—schemas, databases, tables, views, materialized views, columns, and their relationships. The actual business data in your tables remains in your Hive database.
For example, if you have a CUSTOMERS table with customer records, Atlan discovers:
- The table structure (table name, database, schema)
- Column definitions (column names, data types, nullability)
- Relationships (foreign keys, if configured)
Atlan never queries or stores the customer records themselves.
Read-only operations
All database queries are read-only SELECT statements. The connector can't:
- Modify data (INSERT, UPDATE, DELETE)
- Create or drop database objects
- Change any configuration
- Execute stored procedures or functions
- Grant or revoke permissions
The Hive user permissions you grant control exactly what the connector can access.
Credential encryption
Hive connection credentials are encrypted at rest and in transit:
Direct connectivity:
- Credentials are encrypted before storage in Atlan Cloud
- Encryption keys are managed by Atlan's key management system
- Credentials are decrypted only when establishing connections
Self-deployed runtime:
- Basic authentication credentials never leave your network perimeter
- The runtime retrieves credentials from your enterprise-managed secret vaults only when needed
- Kerberos keytabs and certificates are encrypted in Atlan Cloud storage
- File downloads from Atlan Cloud storage use encrypted channels
Network isolation with Self-deployed runtime
Your Hive database gains complete network isolation from the internet:
- The database only accepts connections from the runtime within your local network
- The runtime itself only makes outbound HTTPS connections to Atlan Cloud
- No inbound connections to your network are required
- Your network team can control runtime connectivity through firewall rules
Authentication security
With Kerberos authentication:
- No passwords are transmitted over the network
- Tickets are time-limited and automatically expire
- Keytabs provide secure, non-interactive authentication
- Mutual authentication verifies both client and server identity
With TLS/MTLS:
- All traffic is encrypted to prevent eavesdropping
- Server identity is verified through certificate validation
- Client identity is verified in MTLS configurations
- Certificate expiration enforces regular security reviews
See also
- Set up Hive: Configure authentication and permissions
- Crawl Hive: Configure and run extraction (including Agent deployment)
- Self-Deployed Runtime architecture: Core components and data flow
- Self-Deployed Runtime security: Security architecture, authentication, and encryption