Data Profile

Let's examine the data available from your source

πŸ“‹ What is data profiling?

As Ralph Kimball puts it, β€œData profiling is the systematic analysis of the content of a data source...”

Data profiling refers to the process of examining, analyzing, and reviewing the data available in the source by collecting statistical information about the dataset's quality and hygiene. This process is called also data archaeology, data assessment, data discovery, or data quality analysis.

Data profiling helps in determining the accuracy, completeness, structure, and quality of your data.

Data Profile Overview

The process of data profiling in Atlan involves πŸ‘‡

  • Collecting descriptive statistics like minimum, maximum, mean, median, and standard deviation

  • Collecting data types, along with the minimum and maximum length

  • Determining the percentages of distinct or missing data

  • Tagging data with classification, descriptions, or glossary terms

  • Identifying frequency distributions and significant values

πŸš€ Superpower

Atlan gives users the freedom to customize their data quality report as per need. You can generate the profile for a selected % of the data sample or its subset. It also lets you choose the kind of metrics to check for each column in the table

✨ Spotlight: You can also save the DQ configuration at different cuts and re-use it.

For example, if there is a table with data from different brands and you want to generate the quality report for a specific brand, what will you do? You can simply filter the data using config options, save it, and generate your report for the required brand.

πŸ›οΈ Anatomy of a Data Profile

The Data Profile on Atlan can be viewed in two formats: Profile and Report. The Profile is a tabular version in a spreadsheet format, whereas the Report is a documented version that is easier to download and share.

The Data Profile has various components:

Frequency distribution chart

This shows how many times each distinct value appears in the data.

Data types

The Data Profile highlights the data type of each variable in the data source. Below are the categories that are showcased for each data type:

  1. Boolean

  2. Categorical

  3. Date

  4. Decimal

  5. Integer

  6. String

  7. Unknown

You can refer here for more infomation about data types.

Statistical metrics

The Data Profile is a collection of statistical metrics to help you understand and determine the accuracy, completeness, structure, and quality of your data. There are options to generate and review the statistical metrics for either the entire dataset, a sample of the source data, or just specific variables.

  1. Min

  2. Max

  3. Mean

  4. Median

  5. Standard deviation

  6. Minimum length

  7. Maximum length

  8. Distinct (%)

  9. Missing (%)

Classification

The Data Profile mentions the classification πŸ”– tagged to each variable.

Terms

The Data Profile also contains the Glossary terms 🏷️ associated with each variable.

Run versions

The top toolbar shows the list of versions of the Data Profile generated so far. The different versions of the Data Profile report help you analyze and compare the changes.

Run details

This section showcases the run πŸƒ and configuration details for the selected run version. This includes the rows and columns analyzed, data sample size, compute configuration, run duration, and status.

πŸ› οΈ Configuring and scheduling a Data Profile run

Atlan gives you the flexibility to configure the Data Profile run based on your needs.

While configuring the run, you have multiple options for generating a report. Let's go through each option and understand how they can be used.

πŸ₯ƒ STEP 1: Sampling and filtering

The first step while configuring a run is to decide the sample size or filter the data to be profiled. You have three options:

  1. Run on the full dataset: You can choose to generate the Data Profile on the entire dataset. Remember that the time and cost for this depends on the size of your data. Please be mindful of that if you generate a report for an entire dataset.

  2. Run on a sample: Another option is to generate the Data Profile for a specific sample size. You can choose the sample size and sampling method. Approximation algorithms like HyperLogLog++ and KLL Sketch are also used on metrics like distinct and median to get faster approximate results.

  3. Run based on filters: Filtering the data is another option. This will let you filter the data as needed and generate a specific report on the filtered dataset.

βœ”οΈ STEP 2: Selecting columns and metrics

Once the sample size and the filter has been selected, it's time for you to select the columns and metrics. You can select from the list of metrics, or it will select all by default.

πŸ›’ STEP 3: Select the resource pool

After selecting the sample size and columns, the next step is to specify the resources you want to allocate to the Data Profile run. The more resources you allocate, the quicker the run and the higher the cost.

There are 3 predefined resource tiers to pick from, or you can define the exact resources you want to allocate.

Resource Tier

Bronze

Silver

Gold

Driver cores

1

2

3

Driver memory (in MB)

1024

4096

8192

Executor cores

1

2

3

Executor memory (in MB)

1024

6144

11264

Executor instances

2

4

6

What are Drivers and Executors?

Driver: A Driver schedules tasks for the Executors to execute. Assign Driver resources based on the number of metrics to be generated.

Executor: An Executor is where metrics are actually generated. Assign Executor resources based on the data in the table.

πŸ“… STEP 4: Scheduling the Data Profile run

While assigning the resources to your Data Profile run, you can also schedule when the run will happen. You can choose the frequency, run time ⏱️ and timezone.

There are three options for frequency:

  1. Daily

  2. Weekly

  3. Monthly

Scheduling a Data Profile run ensures you always have an updated Data Profile report when you need it.

Once you've completed configuration, Click on "Run Now" and your Data Profile run will begin. You can monitor the progress of the run from the "Run details" section.