As Ralph Kimball puts it, “Data profiling is the systematic analysis of the content of a data source...”
Data profiling refers to the process of examining, analyzing, and reviewing the data available in the source by collecting statistical information about the dataset's quality and hygiene. This process is called also data archaeology, data assessment, data discovery, or data quality analysis.
Data profiling helps in determining the accuracy, completeness, structure, and quality of your data.
Collecting descriptive statistics like minimum, maximum, mean, median, and standard deviation
Collecting data types, along with the minimum and maximum length
Determining the percentages of distinct or missing data
Tagging data with classification, descriptions, or glossary terms
Identifying frequency distributions and significant values
Atlan gives users the freedom to customize their data quality report as per need. You can generate the profile for a selected % of the data sample or its subset. It also lets you choose the kind of metrics to check for each column in the table
For example, if there is a table with data from different brands and you want to generate the quality report for a specific brand, what will you do? You can simply filter the data using config options, save it, and generate your report for the required brand.
The Data Profile on Atlan can be viewed in two formats: Profile and Report. The Profile is a tabular version in a spreadsheet format, whereas the Report is a documented version that is easier to download and share.
The Data Profile has various components:
This shows how many times each distinct value appears in the data.
The Data Profile highlights the data type of each variable in the data source. Below are the categories that are showcased for each data type:
You can refer here for more infomation about data types.
The Data Profile is a collection of statistical metrics to help you understand and determine the accuracy, completeness, structure, and quality of your data. There are options to generate and review the statistical metrics for either the entire dataset, a sample of the source data, or just specific variables.
The Data Profile mentions the classification 🔖 tagged to each variable.
The Data Profile also contains the Glossary terms 🏷️ associated with each variable.
The top toolbar shows the list of versions of the Data Profile generated so far. The different versions of the Data Profile report help you analyze and compare the changes.
This section showcases the run 🏃 and configuration details for the selected run version. This includes the rows and columns analyzed, data sample size, compute configuration, run duration, and status.
Atlan gives you the flexibility to configure the Data Profile run based on your needs.
While configuring the run, you have multiple options for generating a report. Let's go through each option and understand how they can be used.
The first step while configuring a run is to decide the sample size or filter the data to be profiled. You have three options:
Run on the full dataset: You can choose to generate the Data Profile on the entire dataset. Remember that the time and cost for this depends on the size of your data. Please be mindful of that if you generate a report for an entire dataset.
Run on a sample: Another option is to generate the Data Profile for a specific sample size. You can choose the sample size and sampling method. Approximation algorithms like HyperLogLog++ and KLL Sketch are also used on metrics like distinct and median to get faster approximate results.
Run based on filters: Filtering the data is another option. This will let you filter the data as needed and generate a specific report on the filtered dataset.
Once the sample size and the filter has been selected, it's time for you to select the columns and metrics. You can select from the list of metrics, or it will select all by default.
After selecting the sample size and columns, the next step is to specify the resources you want to allocate to the Data Profile run. The more resources you allocate, the quicker the run and the higher the cost.
There are 3 predefined resource tiers to pick from, or you can define the exact resources you want to allocate.
Driver memory (in MB)
Executor memory (in MB)
Driver: A Driver schedules tasks for the Executors to execute. Assign Driver resources based on the number of metrics to be generated.
Executor: An Executor is where metrics are actually generated. Assign Executor resources based on the data in the table.
While assigning the resources to your Data Profile run, you can also schedule when the run will happen. You can choose the frequency, run time ⏱️ and timezone.
There are three options for frequency:
Scheduling a Data Profile run ensures you always have an updated Data Profile report when you need it.
Once you've completed configuration, Click on "Run Now" and your Data Profile run will begin. You can monitor the progress of the run from the "Run details" section.