πŸ“œ
Our Manifesto
🧰
Backup & Disaster Recovery
πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦ Customer Success & Supporty
πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦ Community
Powered By GitBook
Data Profile
Let's examine the data available from your source.

πŸ“‹ What is data profiling?

As Ralph Kimball puts it, β€œData profiling is the systematic analysis of the content of a data source...”
Data profiling refers to the process of examining, analyzing, and reviewing the data available in the source by collecting statistical information about the data set's quality and hygiene. This process is called also data archaeology, data assessment, data discovery, or data quality analysis.
Data profiling helps in determining the accuracy, completeness, structure, and quality of your data.
Data Profile Overview

The process of data profiling in Atlan involves πŸ‘‡

    Collecting descriptive statistics like minimum, maximum, mean, median, and standard deviation.
    Collecting data types, along with the minimum and maximum length.
    Determining the percentages of distinct or missing data.
    Tagging data with classification, descriptions, or glossary terms.
    Identifying frequency distributions and significant values.

πŸš€ Data profiling superpowers

Atlan gives you the freedom to customize your data quality reports. You can generate the profile for a selected percentage of the data sample or its subset. Atlan also lets you choose the kind of metrics to check for each column in the table.
✨ Spotlight: You can also save and reuse the data quality configuration at different cuts (e.g. using different aggregations and filters, such as monthly, at a state level, or only for a selected category).
For example, say there is a table with data from different categories, and you want to generate the quality report for a specific category. What will you do? You can simply filter the data using configuration options, save it, and generate your profile for the required brand.

πŸ›οΈ Anatomy of a Data Profile

On selecting a configuration, you can view the latest generated metrics for all selected columns in Atlan's Data Profile. You can click and expand to deepdive into the trends of every metric for upto last 10 runs for the selected configuration.
The Data Profile has various components:

Frequency distribution chart

This shows how many times each distinct value appears in the data.

Data types

The Data Profile highlights the data type of each variable in the data source. Below are the data type categories:
    1.
    Boolean
    2.
    Categorical
    3.
    Date
    4.
    Decimal
    5.
    Integer
    6.
    String
    7.
    Unknown

Statistical metrics

The Data Profile is a collection of statistical metrics to help you determine the accuracy, completeness, structure, and quality of your data. You can generate and review the statistical metrics for the entire data set, a sample of the source data, or just specific variables.
    1.
    Min
    2.
    Max
    3.
    Mean
    4.
    Median
    5.
    Standard deviation
    6.
    Minimum length
    7.
    Maximum length
    8.
    Distinct (%)
    9.
    Missing (%)

Classification

The Data Profile mentions the classification πŸ”– tagged to each variable.

Terms

The Data Profile also contains the Glossary terms 🏷️ associated with each variable.

Run status

The top toolbar shows the status of the 5 latest runs for the selected Data Profile configuration. The schedule of each config is also clearly displayed when applicable.

Configuration details

This section showcases the configuration details for the selected configuration. This includes the rows and columns analyzed, data sample size, compute configuration, run duration, and status.

πŸ› οΈ Configuring and scheduling a Data Profile run

Atlan gives you the flexibility to configure a Data Profile run exactly as you need.
While configuring the run, you have multiple options for generating a report. Let's go through each option and understand how it can be used.

βš™οΈ STEP 0: Choosing an existing configuration

The list of your existing configurations for this asset will appear here, along with small details for additional context. You can select an existing config from the list and proceed to Run, Edit or Delete the configuration. Clicking on New Configuration will take you to the Config creation flow.

πŸ₯ƒ STEP 1: Sampling and filtering

The first step while configuring a run is deciding the sample size or filtering the data to be profiled. You have three options:
    1.
    Run on the full data set: You can choose to generate the Data Profile on the entire data set. The time and cost for this depends on the size of your data, so please be mindful of that if you generate a report for an entire data set.
    2.
    Run on a sample: Another option is to generate the Data Profile for a specific sample. You can choose the sample size and sampling method. Approximation algorithms like HyperLogLog++ and KLL Sketch are also used on metrics like distinct and median to get faster approximate results.
    3.
    Run based on filters: Another option is to filter the data as needed and generate a specific report on the filtered data set.

βœ”οΈ STEP 2: Selecting columns and metrics

Once the sample size and the filter has been selected, it's time for you to select the columns and metrics. You can select from the list of metrics, or it will select all by default.

πŸ›’ STEP 3: Selecting the resource pool and scheduling the run

After selecting the sample size and columns, the next step is to specify the resources you want to allocate to the Data Profile run. For Integrations that support Great Expectations, Atlan will auto-recommend it to you, as it is faster and less resource intensive. However, we currently don't support calculating MinLength and MaxLength when using Great Expectations.
For integrations that only support Spark, you can pick and choose the resources you want to allocate - the more resources you allocate, the quicker the run and the higher the cost.
You can pick from 3 predefined resource tiers or define the exact resources you want to allocate.
Resource Tier
Bronze
Silver
Gold
Driver cores
1
2
3
Driver memory (in MB)
1024
4096
8192
Executor cores
1
2
3
Executor memory (in MB)
1024
6144
11264
Executor instances
2
4
6
πŸ‘€ Note: What are Drivers and Executors?
Driver: A Driver schedules tasks for the Executors to execute. Assign Driver resources based on the number of metrics to be generated.
Executor: An Executor is where metrics are actually generated. Assign Executor resources based on the data in the table.
While assigning the resources to your Data Profile run, you can also schedule when the run will happen. You can choose the frequency, run time ⏱️, and time zone.
There are three options for frequency:
    1.
    Daily
    2.
    Weekly
    3.
    Monthly
Scheduling a Data Profile run ensures that you will always have an updated Data Profile report when you need it.
Once you've completed your configuration, name your configuration and click on "Save and Run Now" or "Save and Close". Your Data Profile run will begin immediately or as scheduled based on the option you selected. You can now see the configuration you created in the configuration dropdown, and monitor the progress of the run.

πŸš› Bulk data profiling

Often you might find it inefficient to configure data quality runs for individual tables. The Bulk Profile feature in Atlan lets you generate the profile for multiple assets in a database or schema at one go. You can access this option from the Integrations profile, in the Bulk Profile sub menu.
The configuration steps are the same for bulk and individual tables. Select the database, define the sample percentage or complete data, choose the metrics to calculate, and set a frequency (e.g. daily, weekly, monthly).
πŸ‘€ Note: Atlan does not support column-level customization when bulk profiling β€” i.e. you can't choose which metrics to calculate for which column. The selected metrics will be calculated for all columns in the asset. Similarly, filtering a data set is not available at a bulk level.
When you click "Run Config" or when the scheduled time arrives, Atlan will start generating data profiles for all the selected tables. You can see the run status from the "Runs" tab.
After the bulk profiling has successfully run, select the "Independent Run" category to see the results at the asset level.
Last modified 3d ago