Options to enable querying on data lake integrations
Why do you need a query engine?
As the name suggestions, data lakes are a way to store huge amounts of data cheaply. They provide a storage layer for your data. Tables in a data lake are nothing but prefixes/folders in the storage system. That's why it's important to have a query engine — an external system for searching the data scattered across the data lake.
To start querying a data lake, you need to configure a metastore and query engine.
A metastore stores all the metadata of the table — e.g. the table name, schema of the table, location of data, etc.
A query engine connects to the metastore and uses worker instances to search and fetch data from the data lake.
Here are some examples of storage layers that need an external query engine to query the data:
Hive: Hive is an open-source metastore engine. It is generally used to store metadata related to your tables. Most of the production use cases require an external query engine to query the underlying data.
AWS Glue: Amazon's Glue is just an AWS-managed Hive, containing table information of the data stored in your S3 account. This again requires an external query engine to query the actual data.
👀 Note: Hive can be used to store data and create native Hive tables, but this is not recommended.
Hive can also act as a query engine with its own workers. However, this is not effective, and there are better alternatives in the current data ecosystem.
Query engine support on Atlan
With Atlan, there are two options to connect a query engine to query your data lake.