Incremental extraction
Incremental extraction is an optimization feature that dramatically improves the efficiency of metadata refreshes for Microsoft Power BI workspaces. Instead of reprocessing everything on every sync, Atlan intelligently processes only the workspaces that have actually changed.
This explanation walks through how incremental extraction works, why it operates at the workspace level, and what performance improvements you can expect in different environments. Understanding these concepts helps you optimize your sync strategy and set realistic expectations for your specific Power BI setup.
Why incremental extraction mattersβ
When Atlan connects to Microsoft Power BI workspaces, it needs to regularly sync metadata to keep your data catalog up to date. Without optimization, every sync reprocesses all workspaces, datasets, reports, and dashboardsβeven those that haven't changed. This creates unnecessary overhead and longer processing times, especially in large environments.
Incremental extraction solves this efficiency problem by processing only what has actually changed since the last sync.
How extraction process worksβ
The incremental extraction process follows a three-step cycle that establishes a baseline, monitors changes, and selectively processes only what needs updating.
-
Establish initial baseline: During the first extraction run, Atlan establishes a baseline by processing every workspace and recording when this complete sync occurred.
π First Run (Full Extraction)
βββ βοΈ Atlan processes ALL workspaces
βββ π Extracts all assets (datasets, reports, dashboards)
βββ π Records timestamp: T1
βββ πΎ Stores complete metadata snapshot -
Track workspace modifications: Between sync runs, Power BI's Admin APIs continuously track workspace-level changes. Any modificationβwhether major (like adding a new dataset) or minor (like renaming a report tile)βmarks the entire workspace as "modified" with a new timestamp.
π‘ Between Runs
βββ π Workspace A: Last modified T2 (after T1) β βοΈ Changed
βββ π Workspace B: Last modified T0 (before T1) β β Unchanged
βββ π Workspace C: Last modified T3 (after T1) β βοΈ Changed
βββ π Workspace D: Last modified T0 (before T1) β β Unchanged -
Process only modified workspaces: On the next run, Atlan queries Power BI's API asking "which workspaces have been modified since timestamp T1?" It then processes only those flagged workspaces.
π Second Run (Incremental Extraction)
βββ π Query: "Which workspaces changed since T1?"
βββ π‘ API Response: Workspace A, Workspace C
βββ βοΈ Process ONLY Workspace A and C
β βββ π Re-extract all assets within these workspaces
β βββ π Update lineage information
βββ βοΈ Skip Workspace B and D (unchanged)
βββ π Record new timestamp: T4
Navigate workspace-level granularity constraintsβ
Power BI's API operates at the workspace level, not the individual asset level. This means:
- Trigger: Any change within a workspace (even minor ones)
- Response: The entire workspace gets marked as modified
- Processing: All assets within that workspace must be reprocessed
π Workspace Modification Example
βββ π Original state: 50 reports, 10 datasets
βββ βοΈ Change made: Rename 1 report title
βββ π‘ API response: Entire workspace flagged as "modified"
βββ βοΈ Atlan processes: All 50 reports + 10 datasets
This granularity limitation is a constraint of Power BI's architecture, not Atlan's approach.
Understanding performance impactβ
The effectiveness of incremental extraction depends on your environment's change patterns:
-
High-efficiency scenarios (60-70% improvement): Large, stable environments where most workspaces remain unchanged between runs create the ideal conditions. If only 2 out of 20 workspaces change daily, you process 90% fewer workspaces.
-
Moderate-efficiency scenarios (20-30% improvement): Moderately active environments where roughly half the workspaces change between runs still provide meaningful savings, though less dramatic.
-
Low-efficiency scenarios (under 5% improvement): High-churn environments where most workspaces change frequently between runs see minimal benefits, as most workspaces require reprocessing anyway.
Maximize efficiency gainsβ
-
Sync frequency: Daily runs minimize the accumulation of changes, keeping more workspaces "unchanged" between runs.
-
Environment stability: Workspaces with infrequent modifications create more opportunities for skipping during incremental runs.
-
Change distribution: Environments where changes cluster in specific workspaces (rather than spreading across all workspaces) benefit more from incremental extraction.
See alsoβ
- Crawl Microsoft Power BI assets: Configure and run Power BI metadata extraction workflows
- Troubleshooting connectivity: Resolve common Power BI connector issues and performance problems