Manage column profiling information
Profiling gives additional context to columns in relational stores. From profiling, you can see various summarized information such as:
- numerical statistics (min, max, mean, median, standard deviation, sum, variance) for numeric columns
- minimum, maximum, and average lengths for string columns
- distinct value counts and percentages
- missing value counts and percentages
Profiling is only available on columns
You will only be able to populate this summary information on columns, not on other assets in Atlan.
Retrieve profiles
Since profiles are only available on columns, you will need to retrieve column assets to see the profiles:
- Java
- Python
- Kotlin
- Raw REST API
Retrieve profiles
Column column = Column.get(client, // (1)
"default/hive/1657025257/OPS/DEFAULT/RUN_STATS/STATUS", true); // (2)
column.getColumnDistinctValuesCount(); // (3)
column.getColumnUniqueValuesCount();
column.getColumnUniquenessPercentage();
column.getColumnDuplicateValuesCount();
column.getColumnMissingValuesCount();
column.getColumnMissingValuesPercentage();
column.getColumnMax(); // (4)
column.getColumnMin();
column.getColumnMean();
column.getColumnMedian();
column.getColumnStandardDeviation();
column.getColumnVariance();
column.getColumnSum();
column.getColumnMinimumStringLength(); // (5)
column.getColumnMaximumStringLength();
column.getColumnAverageLength();
- Use the
get()method to retrieve all details about a specific column. Because this operation will retrieve the asset from Atlan, you must provide it anAtlanClientthrough which to connect to the tenant. - Provide the full, case-sensitive qualifiedName of the column.
- Some profile information is common, regardless of the data type of the column.
- Some profile information is specific to numeric columns.
- Some profile information is specific to string columns.
Retrieve profiles
from pyatlan.client.atlan import AtlanClient
from pyatlan.model.assets import Column
client = AtlanClient()
column = client.asset.get_by_qualified_name( # (1)
qualified_name="default/hive/1657025257/OPS/DEFAULT/RUN_STATS/STATUS", # (2)
asset_type=Column
)
column.column_distinct_values_count # (3)
column.column_unique_values_count
column.column_uniqueness_percentage
column.column_duplicate_values_count
column.column_missing_values_count
column.column_missing_values_percentage
column.column_max # (4)
column.column_min
column.column_mean
column.column_median
column.column_standard_deviation
column.column_variance
column.column_sum
column.column_minimum_string_length # (5)
column.column_maximum_string_length
column.column_average_length
- Use the
get_by_qualified_name()method to retrieve all details about a specific column. - Provide the full, case-sensitive qualifiedName of the column.
- Some profile information is common, regardless of the data type of the column.
- Some profile information is specific to numeric columns.
- Some profile information is specific to string columns.
Retrieve profiles
val column = Column.get(client, // (1)
"default/hive/1657025257/OPS/DEFAULT/RUN_STATS/STATUS") // (2)
column.columnDistinctValuesCount // (3)
column.columnUniqueValuesCount
column.columnUniquenessPercentage
column.columnDuplicateValuesCount
column.columnMissingValuesCount
column.columnMissingValuesPercentage
column.columnMax // (4)
column.columnMin
column.columnMean
column.columnMedian
column.columnStandardDeviation
column.columnVariance
column.columnSum
column.columnMinimumStringLength // (5)
column.columnMaximumStringLength
column.columnAverageLength
- Use the
get()method to retrieve all details about a specific column. Because this operation will retrieve the asset from Atlan, you must provide it anAtlanClientthrough which to connect to the tenant. - Provide the full, case-sensitive qualifiedName of the column.
- Some profile information is common, regardless of the data type of the column.
- Some profile information is specific to numeric columns.
- Some profile information is specific to string columns.
GET /api/meta/entity/uniqueAttribute/type/Column?attr:qualifiedName=default%2Fhive%2F1657025257%2FOPS%2FDEFAULT%2FRUN_STATS%2FSTATUS
{
"entity": { // (1)
"typeName": "Column", // (2)
"attributes": { // (3)
"name": "STATUS",
"qualifiedName": "default/hive/1657025257/OPS/DEFAULT/RUN_STATS/STATUS",
"columnDistinctValuesCount": 123, // (4)
"columnUniqueValuesCount": 123,
"columnUniquenessPercentage": 50.0,
"columnDuplicateValuesCount": 123,
"columnMissingValuesCount": 123,
"columnMissingValuesPercentage": 50.0,
"columnMax": 321.0,
"columnMin": 1.0,
"columnMean": 123.0,
"columnMedian": 123.0,
"columnStandardDeviation": 3.0,
"columnVariance": 1.0,
"columnSum": 654321.0,
"columnMinimumStringLength": 0,
"columnMaximumStringLength": 123,
"columnAverageLength": 123.0
}
}
}
- All column details will come back wrapped in a top-level
entityobject in the payload. - The
typeNamewill always beColumn. - The detailed profiling information will be embedded in the
attributesobject within the outerentityobject. - The column profiling details have names that all start with
column...
Add your own profiles
In cases where Atlan doesn't profile the source, you may want to add your own profiles. You can do this by either adding the profile when creating the column (programmatically) or by updating the attributes on an existing column:
- Java
- Python
- Kotlin
- Raw REST API
Add or update profiles
Column column = Column.updater( // (1)
"default/hive/1657025257/OPS/DEFAULT/RUN_STATS/STATUS", // (2)
"STATUS") // (3)
.columnDistinctValuesCount(123) // (4)
.columnUniqueValuesCount(123)
.columnUniquenessPercentage(50.0)
.columnDuplicateValuesCount(123)
.columnMissingValuesCount(123)
.columnMissingValuesPercentage(50.0)
.columnMax(321.0) // (5)
.columnMin(1.0)
.columnMean(123.0)
.columnMedian(123.0)
.columnStandardDeviation(3.0)
.columnVariance(1.0)
.columnSum(654321.0)
.columnMinimumStringLength(0) // (6)
.columnMaximumStringLength(123)
.columnAverageLength(123.0)
.build(); // (7)
AssetMutationResponse response = column.save(client); // (8)
- Use the
updater()method to update an existing column asset (for more details, see Updating an asset). - Provide the full, case-sensitive qualifiedName of the column.
- Provide the case-sensitive name of the column.
- Some profile information is common, regardless of the data type of the column. All are optional, so fill in only the pieces you want or for which you have the information.
- Some profile information is specific to numeric columns. All are optional, so fill in only the pieces you want or for which you have the information.
- Some profile information is specific to string columns. All are optional, so fill in only the pieces you want or for which you have the information.
- Use the
build()method to construct the column object to be updated in Atlan. - Then call the
save()method against this built-up object to actually apply the update to Atlan. Because this operation will persist the asset in Atlan, you must provide it anAtlanClientthrough which to connect to the tenant.
Retrieve profiles
from pyatlan.client.atlan import AtlanClient
from pyatlan.model.assets import Column
client = AtlanClient()
column = Column.updater( # (1)
qualified_name="default/hive/1657025257/OPS/DEFAULT/RUN_STATS/STATUS", # (2)
name="STATUS" # (3)
)
column.column_distinct_values_count = 123 # (4)
column.column_unique_values_count = 123
column.column_uniqueness_percentage = 50.0
column.column_duplicate_values_count = 123
column.column_missing_values_count = 123
column.column_missing_values_percentage = 50.0
column.column_max = 321.0 # (5)
column.column_min = 1.0
column.column_mean = 123.0
column.column_median = 123.0
column.column_standard_deviation = 3.0
column.column_variance = 1.0
column.column_sum = 654321.0
column.column_minimum_string_length = 0 # (6)
column.column_maximum_string_length = 123
column.column_average_length = 123.0
response = client.asset.save(column) # (7)
- Use the
updater()method to update an existing column asset (for more details, see Updating an asset). - Provide the full, case-sensitive qualified_name of the column.
- Provide the case-sensitive name of the column.
- Some profile information is common, regardless of the data type of the column. All are optional, so fill in only the pieces you want or for which you have the information.
- Some profile information is specific to numeric columns. All are optional, so fill in only the pieces you want or for which you have the information.
- Some profile information is specific to string columns. All are optional, so fill in only the pieces you want or for which you have the information.
- Then call the
save()method with this built-up object to actually apply the update to Atlan.
Add or update profiles
val column = Column.updater( // (1)
"default/hive/1657025257/OPS/DEFAULT/RUN_STATS/STATUS", // (2)
"STATUS") // (3)
.columnDistinctValuesCount(123) // (4)
.columnUniqueValuesCount(123)
.columnUniquenessPercentage(50.0)
.columnDuplicateValuesCount(123)
.columnMissingValuesCount(123)
.columnMissingValuesPercentage(50.0)
.columnMax(321.0) // (5)
.columnMin(1.0)
.columnMean(123.0)
.columnMedian(123.0)
.columnStandardDeviation(3.0)
.columnVariance(1.0)
.columnSum(654321.0)
.columnMinimumStringLength(0) // (6)
.columnMaximumStringLength(123)
.columnAverageLength(123.0)
.build() // (7)
val response = column.save(client) // (8)
- Use the
updater()method to update an existing column asset (for more details, see Updating an asset). - Provide the full, case-sensitive qualifiedName of the column.
- Provide the case-sensitive name of the column.
- Some profile information is common, regardless of the data type of the column. All are optional, so fill in only the pieces you want or for which you have the information.
- Some profile information is specific to numeric columns. All are optional, so fill in only the pieces you want or for which you have the information.
- Some profile information is specific to string columns. All are optional, so fill in only the pieces you want or for which you have the information.
- Use the
build()method to construct the column object to be updated in Atlan. - Then call the
save()method against this built-up object to actually apply the update to Atlan. Because this operation will persist the asset in Atlan, you must provide it anAtlanClientthrough which to connect to the tenant.
POST /api/meta/entity/bulk
{
"entities": [ // (1)
}
]
}
- All columns must be wrapped in an
entitiesarray. - The
typeNamemust always beColumnfor profiling information. - You must provide the exact name of the column (case-sensitive).
- You must provide the exact
qualifiedNameof the column (case-sensitive). - Some profile information is common, regardless of the data type of the column. All are optional, so fill in only the pieces you want or for which you have the information.
- Some profile information is specific to numeric columns. All are optional, so fill in only the pieces you want or for which you have the information.
- Some profile information is specific to string columns. All are optional, so fill in only the pieces you want or for which you have the information.