Skip to main content

End-to-end bulk update

Running example

To walk through this using an example, and to compare and contrast the approaches, imagine you want to:

  • Mark all views (including materialized views) in a particular schema as verified, unless they already have some certificate.
  • Change the owner of the same views.

Step-by-step

The usual end-to-end pattern for updating many assets efficiently involves three steps:

  1. Finding the assets you want to update.
  2. Applying your updates to each asset (in-memory).
  3. Sending those changes to Atlan (in batches).

You can do each of these steps in sequence, for example:

1. Find assets

You start by first finding the assets you want to update. This is usually best done through a search. (For other common examples, have a look at the search snippets.)

Example: get all views in a schema
String schemaQN = "default/snowflake/1662194632/MYDB/MY_SCH"; // (1)
IndexSearchRequest findViews = client.assets.select() // (2)
.where(Asset.QUALIFIED_NAME.startsWith(schemaQN)) // (3)
.where(Asset.TYPE_NAME.in(List.of(View.TYPE_NAME, MaterializedView.TYPE_NAME))) // (4)
.whereNot(Asset.CERTIFICATE_STATUS.hasAnyValue()) // (5)
.pageSize(100) // (6)
.includeOnResults(Asset.DESCRIPTION) // (7)
.includeOnResults(Asset.CERTIFICATE_STATUS)
.includeOnResults(Asset.OWNER_USERS)
.toRequest(); // (8)

IndexSearchResponse response = findViews.search(); // (9)
  1. The qualifiedName of every view starts with the qualifiedName of its parent (schema), so we can limit the results to a particular schema by using the qualifiedName.

  2. To start building up a query with multiple conditions, you can use the select() helper on any client's assets member.

  3. You can chain where() methods to define all the conditions the search results must match. You can use the static constants within any given type to select a particular attribute (like QUALIFIED_NAME in this example), and then limit results to only those assets whose qualifiedName starts with the qualifiedName of the schema (by using the startsWith() predicate). In this example, that means only assets that are within this particular schema will be returned as results.

  4. Since there could be tables, views, materialized views and columns in this schema—but you only want views and materialized views—you can use the Asset.TYPE_NAME.in() method to restrict results to only views and materialized views.

  5. Since you only want to update views that don't already have a certificate, you can further limit the results using the whereNot() method. This will exclude any assets where a certificate already hasAnyValue().

  6. Here you can play around with different page sizes, to further limit API calls by retrieving more results per page.

  7. Add as many attributes as needed. Each attribute you add here will make sure that detail is included in each search result. So in this example, every view will include its description, certificate, and individual owners. (Limit these attributes to the minimum you need about each view to do your intended work.)

  8. You can translate the object you've built up into various outputs, for example immediately calculating a count of how many results match or streaming them directly for processing. In this case, the toRequest() method will give us the resulting set of criteria back as a complete index search request.

  9. You can then execute the search based on the request.

2. Build-up your changes

Next, you iterate through those results and make the changes you want to each one. Use the multiple operations pattern to make multiple changes to each asset.

Example: iterate through results and make changes
AssetBatch batch = new AssetBatch(client, 20); // (1)
try {
for (Asset result : response)
  1. Create a batch of assets to build-up the changes across multiple assets before applying those changes in Atlan itself.

    • The first parameter defines the Atlan tenant on which the batch will be processed
    • The second specifies the maximum number of assets to build-up before sending them across to Atlan

By default (using only the options above) no classifications or custom metadata will be added or changed on the assets in each batch. To also include classifications and custom metadata, you need to use these additional parameters:

  • A third parameter of true to replace all classifications on the assets in the batch, which would include removing classifications if none are provided for the assets in the batch itself (or false if you still want to ignore classifications)
  • A fourth parameter to control how custom metadata should be handled for the assets: IGNORE any custom metadata changes in the batch, OVERWRITE to replace all custom metadata with what's provided in the batch (including removing custom metadata that already exists on an asset), or MERGE to only add or update custom metadata based on what's in the batch (leaving other existing custom metadata unchanged)
  • a fifth parameter to control whether failures should be captured across batches (true) or ignored (false)
  • a sixth parameter to control whether the batch should only attempt to update assets that already exist (true) or also create assets if they don't yet exist (false)
  • a seventh parameter to control whether details about each created and updated asset across batches should be tracked (true) or ignored (false)—counts will always be kept
  • an eighth parameter to control whether the matching for determining whether an asset already exists should be done in a case-insensitive way (true) or case-sensitively (false)
  • a ninth parameter to control what kind of assets to create, if not running in updateOnly mode: partial assets (only available in lineage), or full assets
  • a tenth parameter to control whether the matching for determining whether an asset already exists should be done strictly according to the data type specified (false), or if tables, views and materialized views should be treated interchangeably (true)
  1. This is the pattern for iterating through all results (across pages) covered in the Searching for assets portion of the SDK documentation.

  2. Every asset implements the trimToRequired() method, which gives you a builder containing only the bare minimum information needed to update that asset.

    Limit your asset to only what you intend to update

When you send an update to Atlan, it will only attempt to change the information you send in your request—leaving any information not in your request as-is (unchanged) on the asset in Atlan. By using trimToRequired() you can remove all information you don't want to update, and then chain on only the details you do want to update. ::: 4. In this running example, you are updating the certificate to verified and setting a new owner—so you simply chain those updates onto the trimmed builder.

  1. You can then add your (in-memory) modified asset to the batch.

    Auto-saves as it goes

As long as the number of assets built-up is below the maximum batch size specified when creating the batch, this will simply continue to build up the batch. As soon as you hit the size limit for the batch, though, this same method will call the save() operation to batch-update all of those assets in a single API call. :::

Remember to flush

Since your loop could finish before you reach another full batch, you must always remember to flush() the batch. This will send any remaining assets that were queued up, when the size of the queue didn't yet reach the full batch size.

3. Save them in batches

Finally, send the changes you have queued up in batches. Use the multiple assets pattern to update multiple assets at the same time.

Example: save the changes in batches
    batch.flush(); // (1)
} catch (AtlanException e)
  1. The AssetBatch's add() method used in the previous step will automatically save as its internal queue of assets reaches a full batch size.

    Remember to flush

However, since your loop could finish before you reach another full batch, you must always remember to flush() the batch. This will send any remaining assets that were queued up. ::: 2. Both the .add() and .flush() operations of the AssetBatch could send a request over to Atlan. Either can therefore also run into trouble and raise an error through an AtlanException. It's up to you to handle such potential errors as you see fit.

Pipelining

Alternatively, when using an SDK, you can pipeline these operations together. The pipeline will run just as efficiently as the step-by-step approach above:

  • Pushing down the criteria to run as a search on Atlan
  • Lazily-fetching each page of results
  • Batching up and bulk-saving changes
Example: pipelining
String schemaQN = "default/snowflake/1662194632/MYDB/MY_SCH"; // (1)
try (ParallelBatch batch = new ParallelBatch(client, 20)) catch (AtlanException e)", result.getQualifiedName());
}
});
batch.flush(); // (15)
log.info("Created: {}", batch.getCreated().size());
log.info("Updated: {}", batch.getUpdated().size());
}
  1. The qualifiedName of every view starts with the qualifiedName of its parent (schema), so we can limit the results to a particular schema by using the qualifiedName.

  2. Create a batch of assets to build-up the changes across multiple assets before applying those changes in Atlan itself. When parallel-processing (see further notes on the stream(true)) you need to use a parallel-capable ParallelBatch:

    • The first parameter defines the Atlan tenant on which the batch will be processed
    • The second specifies the maximum number of assets to build-up before sending them across to Atlan

By default (using only the options above) no classifications or custom metadata will be added or changed on the assets in each batch. To also include classifications and custom metadata, you need to use these additional parameters:

  • A third parameter of true to replace all classifications on the assets in the batch, which would include removing classifications if none are provided for the assets in the batch itself (or false if you still want to ignore classifications)
  • A fourth parameter to control how custom metadata should be handled for the assets: IGNORE any custom metadata changes in the batch, OVERWRITE to replace all custom metadata with what's provided in the batch (including removing custom metadata that already exists on an asset), or MERGE to only add or update custom metadata based on what's in the batch (leaving other existing custom metadata unchanged)
  • a fifth parameter to control whether failures should be captured across batches (true) or ignored (false)
  • a sixth parameter to control whether the batch should only attempt to update assets that already exist (true) or also create assets if they don't yet exist (false)
  • a seventh parameter to control whether details about each created and updated asset across batches should be tracked (true) or ignored (false)—counts will always be kept
  • an eighth parameter to control whether the matching for determining whether an asset already exists should be done in a case-insensitive way (true) or case-sensitively (false)
  • a ninth parameter to control what kind of assets to create, if not running in updateOnly mode: partial assets (only available in lineage), or full assets
  • a tenth parameter to control whether the matching for determining whether an asset already exists should be done strictly according to the data type specified (false), or if tables, views and materialized views should be treated interchangeably (true)
  1. You can then start defining a pipeline directly against the client's assets by using the select() method.

    Including archived (soft-deleted) assets

Searches by default will return all assets that are found—whether active or archived (soft-deleted). In most cases, you probably only want the active ones, so this is the default behavior of select(). Sending in true to this select() method will start the pipeline to include any archived (soft-deleted) assets in the results, if you do want them. ::: 4. You can chain as many where() methods as you want to define all the conditions the search results must match. You can use the static constants within any given type to select a particular attribute (like QUALIFIED_NAME in this example), and then limit results to only those assets whose qualifiedName starts with the qualifiedName of the schema (by using the startsWith() predicate). In this example, that means only assets that are within this particular schema will be returned as results.

  1. Since there could be tables, views, materialized views and columns in this schema—but you only want views and materialized views—you can use the Asset.TYPE_NAME.in() method to restrict results to only views and materialized views.

  2. Since you only want to update views that don't already have a certificate, you can further limit the results using the whereNot() method. This will exclude any assets where a certificate already hasAnyValue().

  3. (Optional) You can play around with different page sizes, to further limit API calls by retrieving more results per page.

  4. Add as many attributes as needed. Each attribute you add here will make sure that detail is included in each search result. So in this example, every view will include its description, certificate, and individual owners. (Limit these attributes to the minimum you need about each view to do your intended work.)

  5. Once you have defined the criteria for your pipeline, call the stream() method to push-down the pipeline to Atlan. This will:

    • Create a search that combines all the criteria you have specified.
    • Run that search against Atlan to produce the first page of results.
    • Page through the results by lazily fetching each subsequent page as you iterate through them. (So if you use a limit() on the stream, for example, you can break out before retrieving all pages.)
    Can also run in parallel threads

You can also parallel-stream the results by passing true to the stream() method. This will spawn multiple threads that each independently process a page of results and combine the results in parallel. While this can be significantly faster for processing many results, keep in mind if you are collecting the results into any structure that structure must be thread-safe. (For example, you'll need to use things like ConcurrentHashMap rather than just HashMap, and to use ParallelBatch rather than AssetBatch if making changes.) ::: 10. For each result, you can then carry out your changes and submit them into the batch.

  1. Every asset implements the trimToRequired() method, which gives you a builder containing only the bare minimum information needed to update that asset.

    Limit your asset to only what you intend to update

When you send an update to Atlan, it will only attempt to change the information you send in your request—leaving any information not in your request as-is (unchanged) on the asset in Atlan. By using trimToRequired() you can remove all information you don't want to update, and then chain on only the details you do want to update. ::: 12. In this running example, you are updating the certificate to verified and setting a new owner—so you simply chain those updates onto the trimmed builder.

  1. You can then add your (in-memory) modified asset to the batch.

    Auto-saves as it goes

As long as the number of assets built-up is below the maximum batch size specified when creating the batch, this will simply continue to build up the batch. As soon as you hit the size limit for the batch, though, this same method will call the save() operation to batch-update all of those assets in a single API call. :::

Remember to flush

Since your loop could finish before you reach another full batch, you must always remember to flush() the batch. This will send any remaining assets that were queued up, when the size of the queue didn't yet reach the full batch size.

  1. Both the .add() and .flush() operations of the AssetBatch could send a request over to Atlan. Either can therefore also run into trouble and raise an error through an AtlanException. It's up to you to handle such potential errors as you see fit.

  2. The AssetBatch's add() method used in the previous step will automatically save as its internal queue of assets reaches a full batch size.

    Remember to flush

However, since your loop could finish before you reach another full batch, you must always remember to flush() the batch. This will send any remaining assets that were queued up. :::

Was this page helpful?