As Ralph Kimball puts it, “Data profiling is the systematic analysis of the content of a data source...”
Data profiling refers to the process of examining, analyzing, and reviewing the data available in the source by collecting statistical information about the data set's quality and hygiene. This process is called also data archaeology, data assessment, data discovery, or data quality analysis.
Data profiling helps in determining the accuracy, completeness, structure, and quality of your data.
Collecting descriptive statistics like minimum, maximum, mean, median, and standard deviation.
Collecting data types, along with the minimum and maximum length.
Determining the percentages of distinct or missing data.
Tagging data with classification, descriptions, or glossary terms.
Identifying frequency distributions and significant values.
Atlan gives you the freedom to customize your data quality reports. You can generate the profile for a selected percentage of the data sample or its subset. Atlan also lets you choose the kind of metrics to check for each column in the table.
For example, say there is a table with data from different categories, and you want to generate the quality report for a specific category. What will you do? You can simply filter the data using configuration options, save it, and generate your profile for the required brand.
On selecting a configuration, you can view the latest generated metrics for all selected columns in Atlan's Data Profile. You can click and expand to deepdive into the trends of every metric for upto last 10 runs for the selected configuration.
The Data Profile has various components:
This shows how many times each distinct value appears in the data.
The Data Profile highlights the data type of each variable in the data source. Below are the data type categories:
The Data Profile is a collection of statistical metrics to help you determine the accuracy, completeness, structure, and quality of your data. You can generate and review the statistical metrics for the entire data set, a sample of the source data, or just specific variables.
The Data Profile mentions the classification 🔖 tagged to each variable.
The Data Profile also contains the Glossary terms 🏷️ associated with each variable.
The top toolbar shows the status of the 5 latest runs for the selected Data Profile configuration. The schedule of each config is also clearly displayed when applicable.
This section showcases the configuration details for the selected configuration. This includes the rows and columns analyzed, data sample size, compute configuration, run duration, and status.
Atlan gives you the flexibility to configure a Data Profile run exactly as you need.
While configuring the run, you have multiple options for generating a report. Let's go through each option and understand how it can be used.
The list of your existing configurations for this asset will appear here, along with small details for additional context. You can select an existing config from the list and proceed to Run, Edit or Delete the configuration. Clicking on New Configuration will take you to the Config creation flow.
The first step while configuring a run is deciding the sample size or filtering the data to be profiled. You have three options:
Run on the full data set: You can choose to generate the Data Profile on the entire data set. The time and cost for this depends on the size of your data, so please be mindful of that if you generate a report for an entire data set.
Run on a sample: Another option is to generate the Data Profile for a specific sample. You can choose the sample size and sampling method. Approximation algorithms like HyperLogLog++ and KLL Sketch are also used on metrics like distinct and median to get faster approximate results.
Run based on filters: Another option is to filter the data as needed and generate a specific report on the filtered data set.
Once the sample size and the filter has been selected, it's time for you to select the columns and metrics. You can select from the list of metrics, or it will select all by default.
After selecting the sample size and columns, the next step is to specify the resources you want to allocate to the Data Profile run. For Integrations that support Great Expectations, Atlan will auto-recommend it to you, as it is faster and less resource intensive. However, we currently don't support calculating MinLength and MaxLength when using Great Expectations.
For integrations that only support Spark, you can pick and choose the resources you want to allocate - the more resources you allocate, the quicker the run and the higher the cost.
You can pick from 3 predefined resource tiers or define the exact resources you want to allocate.
Driver memory (in MB)
Executor memory (in MB)
While assigning the resources to your Data Profile run, you can also schedule when the run will happen. You can choose the frequency, run time ⏱️, and time zone.
There are three options for frequency:
Scheduling a Data Profile run ensures that you will always have an updated Data Profile report when you need it.
Once you've completed your configuration, name your configuration and click on "Save and Run Now" or "Save and Close". Your Data Profile run will begin immediately or as scheduled based on the option you selected. You can now see the configuration you created in the configuration dropdown, and monitor the progress of the run.
Often you might find it inefficient to configure data quality runs for individual tables. The Bulk Profile feature in Atlan lets you generate the profile for multiple assets in a database or schema at one go. You can access this option from the Integrations profile, in the Bulk Profile sub menu.
The configuration steps are the same for bulk and individual tables. Select the database, define the sample percentage or complete data, choose the metrics to calculate, and set a frequency (e.g. daily, weekly, monthly).
When you click "Run Config" or when the scheduled time arrives, Atlan will start generating data profiles for all the selected tables. You can see the run status from the "Runs" tab.
After the bulk profiling has successfully run, select the "Independent Run" category to see the results at the asset level.