Query Engine

Options to enable querying on data lake integrations

Why do you need a query engine?

As the name suggestions, data lakes are a way to store huge amounts of data cheaply. They provide a storage layer for your data. Tables in a data lake are nothing but prefixes/folders in the storage system. That's it's important to have a query engine โ€” an external system for searching the data scattered across the data lake.

To start querying a data lake, you need to configure a metastore and query engine.

  • A metastore stores all the metadata of the table โ€” e.g. the table name, schema of the table, location of data, etc.

  • A query engine connects to the metastore and uses worker instances to search and fetch data from the data lake.

Here are some examples of storage layers that need an external query engine to query the data:

  • โ€‹AWS S3: Amazon Simple Storage System

  • โ€‹ADLS: Azure Data Lake System (Gen1 and Gen2)

  • โ€‹Hive: Hive is an open-source metastore engine. It is generally used to store metadata related to your tables. Most of the production use cases require an external query engine to query the underlying data.

  • โ€‹AWS Glue: Amazon's Glue is just an AWS-managed Hive, containing table information of the data stored in your S3 account. This again requires an external query engine to query the actual data.

๐Ÿ‘€ Note: Hive can be used to store data and create native Hive tables, but this is not recommended.

Hive can also act as a query engine with its own workers. However, this is not effective, and there are better alternatives in the current data ecosystem.

Query engine support on Atlan

With Atlan, there are two options to connect a query engine to query your data lake.

No matter which query engine you choose, you'll need a metastore. You also have two options for your metastore:

Which query engine should you choose?

Atlan supports Delta Lake tables from the AWS S3 and ADLS Gen2 storage layers. It also supports discovery via Hive and Glue metastores.

Check out this support matrix to see the limitations and compatibility of different tools that Atlan supports.

Source

AWS Athena

Starburst Presto

Metastore configuration required?

AWS Glue

Hive

Delta Lake S3

โœ…

โœ…

โœ…

โŒ

โœ…

Delta Lake ADLS

โŒ

โœ…

โœ…

โŒ

โœ…

AWS Glue

โœ…

โœ…

โŒ

NA

NA

Hive

โœ…

โœ…

โŒ

NA

NA

๐Ÿ‘€ Note: AWS Athena and AWS Glue only support S3 lake sources. Hence, they cannot be used in combination to get querying working for Delta Lake on ADLS.

AWS Glue and Hive are themselves metastore engines. Hence, there is no need to connect an external metastore for the query engines.