feat/explains partition and split loading #2737

rudolfix · 2025-06-09T22:09:39Z

Description

This PR documents and tests two methods of splitting pipeline runs in smaller chunks that may be useful for backfilling:

partition loading: where source data is partitioned into several ranges that may be independently loaded
split loading: where source data is split into several sequential ranges using limit item

On top of that a few improvements are added:

filesystem source follows sort_order and may be used for split loading
LimitItem allows to count rows (not only yields/batches)
several improvements to root_key propagation: (1) simplified implementation (2) allows to use parent_key if nesting level < 2 (3) does not enable root_key on scd2 (4) additional tests and docs

see commit log for more

…on due to driver problems

… uses normalizer config prop, adds tests and docs

…lemetry

netlify · 2025-06-09T22:09:44Z

✅ Deploy Preview for dlt-hub-docs canceled.

Name	Link
🔨 Latest commit	`b9bb4dc`
🔍 Latest deploy log	https://app.netlify.com/projects/dlt-hub-docs/deploys/68ce797ce868280008c9ca19

github-actions · 2025-07-11T18:42:03Z

⚠️ Possible file(s) that should be tracked in LFS detected ⚠️

    The following file(s) exceeds the file size limit: 50000 bytes, as set in the .yml configuration files:

    docs/website/docs/dlt-ecosystem/verified-sources/rest_api/basic.md

    Consider using git-lfs to manage large files.

github-actions · 2025-07-11T18:42:05Z

⚠️ Possible file(s) that should be tracked in LFS detected ⚠️

    The following file(s) exceeds the file size limit: 50000 bytes, as set in the .yml configuration files:

    docs/website/docs/dlt-ecosystem/verified-sources/rest_api/basic.md

    Consider using git-lfs to manage large files.

…t range

anuunchin

🧠

dlt/common/data_writers/writers.py

anuunchin · 2025-09-16T14:35:42Z

dlt/sources/filesystem/__init__.py

        file_glob (str, optional): The filter to apply to the files in glob format. by default lists all files in bucket_url non-recursively
        kwargs: (Optional[Dict[str, Any]], optional): Additional arguments passed to fsspec constructor ie. dict(use_ssl=True) for s3fs
        client_kwargs: (Optional[Dict[str, Any]], optional): Additional arguments passed to underlying fsspec native client ie. dict(verify="public.crt) for botocore
+        incremental (Optional[dlt.sources.incremental[Any]]): defines incremental cursor on listed files, with `modification_date`


Suggested change

incremental (Optional[dlt.sources.incremental[Any]]): defines incremental cursor on listed files, with `modification_date`

incremental (Optional[dlt.sources.incremental[Any]]): Defines incremental cursor on listed files, with `modification_date`

anuunchin · 2025-09-16T14:41:22Z

dlt/sources/filesystem/__init__.py

-        client_kwargs: (Optional[Dict[str, Any]]): Additional arguments passed to underlying fsspec native client ie. dict(verify="public.crt) for botocore
+        kwargs (Optional[Dict[str, Any]]): Additional arguments passed to fsspec constructor ie. dict(use_ssl=True) for s3fs
+        client_kwargs (Optional[Dict[str, Any]]): Additional arguments passed to underlying fsspec native client ie. dict(verify="public.crt) for botocore
+        incremental (Optional[dlt.sources.incremental[Any]]): defines incremental cursor on listed files, with `modification_date`


Suggested change

incremental (Optional[dlt.sources.incremental[Any]]): defines incremental cursor on listed files, with `modification_date`

incremental (Optional[dlt.sources.incremental[Any]]): Defines incremental cursor on listed files, with `modification_date`

anuunchin · 2025-09-16T14:48:12Z

dlt/pipeline/drop.py

    state_only: bool = False,
-    sources: Optional[Union[Iterable[Union[str, TSimpleRegex]], Union[str, TSimpleRegex]]] = None,
-) -> _DropResult:
+    # sources: Optional[Union[Iterable[Union[str, TSimpleRegex]], Union[str, TSimpleRegex]]] = None,


Are we removing the source regex, because we're abandoning the idea of multiple sources per schema? 👀

afair (correctly), the multiple sources per schema idea was one of the future todo's in one of the tests

this code had no effect. right now there's 1:1 between schema and a source. so selecting many resources when schema is defined does not make sense

dlt/common/libs/pyiceberg.py

docs/website/docs/dlt-ecosystem/verified-sources/sql_database/advanced.md

anuunchin · 2025-09-17T11:54:17Z

docs/website/docs/dlt-ecosystem/verified-sources/sql_database/advanced.md

+while not pipeline.run(incremental_table.add_limit(2)).is_empty:
+    pass


this is so cool!

anuunchin · 2025-09-17T12:01:22Z

docs/website/docs/dlt-ecosystem/verified-sources/sql_database/configuration.md

+3. **Set the comparison direction in the query**. By default greater than or equal op (**>=**) is used to compare initial/previous value with row column value. You can change it with `last_value_func` argument (**max**/**min**).
+4. **Set if the comparison is inclusive or exclusive**. By default the range is closed (equal values are included). [Look here for explanation and examples](advanced.md#inclusive-and-exclusive-filtering). Note that for closed ranges `dlt` will use [internal deduplication](../../../general-usage/incremental/cursor.md#deduplicate-overlapping-ranges) which adds some processing cost.
+4. **Configure backfill options(optional)**. You can use `end_value` with `range_end` to read data from specified range. You can also control **order returned rows**
+to split long incremental loading into many chunks by time and row count. [Look here for details and examples]


It seems like the part with order returned rows should be a separate point, or no? 👀

docs/website/docs/dlt-ecosystem/verified-sources/sql_database/troubleshooting.md

anuunchin · 2025-09-17T12:12:21Z

docs/website/docs/general-usage/merge-loading.md

+3. You can fix your nested tables in both staging and final datasets. Add `_dlt_root_id` to all nested tables and copy data
+from related [root (top level) tables](../general-usage/schema.md#nested-references-root-and-nested-tables) `_dlt_id` (`row_key`).
+In that case `dlt` will update pipeline schema but will skip database migration.


But as a general rule, should we be endorsing approaches that involve manual tampering inside the destination outside of dlt's scope in cases like these? (i just feel like the first two approaches are more dlt- idiomatic 👀 )

right! but sometimes you want to keep your data and (2) does not apply. it would be cool to have a tool that propagates root key post factum. but we do not have it :)

cloudflare-workers-and-pages · 2025-09-19T21:58:09Z

Deploying with Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status	Name	Latest Commit	Updated (UTC)
❌ Deployment failed View logs	docs	`b9bb4dc`	Sep 20 2025, 09:57 AM

rudolfix added 10 commits June 9, 2025 23:36

extracts a method to count rows in items in data writers

1d12ee2

drains mssql cursor from recordsets, disables multi-statement executi…

1c66841

…on due to driver problems

allows to enable and disable root key propagation via source setting,…

0510595

… uses normalizer config prop, adds tests and docs

allows to use parent_key if nesting level < 2

9321425

documents standard http session settings, sets shorter timeouts in te…

e2e068c

…lemetry

adds way to count rows in add_limit, fixes edge cases

56dd252

propagates error when generating sql jobs

e799a4e

skips two step table create in pyiceberg if no partitions

a2f27b7

adds docs and examples for backfilling

d20e3d8

Merge branch 'devel' into feat/explains-partition-and-split-loading

cd509b8

rudolfix mentioned this pull request Jun 12, 2025

mssql and snowflake bugfixes + bumps to 1.12.0 #2756

Merged

rudolfix mentioned this pull request Jul 7, 2025

Proactive Disk Space Management for Constrained Runtimes (e.g., AWS Glue) via Mid-Extract Processing #2649

Closed

Merge branch 'devel' into feat/explains-partition-and-split-loading

b4d0ef7

github-actions bot added the lfs-detected! (automation) large files were committed to the PR label Jul 11, 2025

excludes md from lfs

caab450

github-actions bot removed the lfs-detected! (automation) large files were committed to the PR label Jul 11, 2025

rudolfix added 9 commits July 13, 2025 19:36

fixes incorrect exit condition in python object incremental open star…

ea215ad

…t range

Merge branch 'devel' into feat/explains-partition-and-split-loading

fdbcd49

simplifies and documents pipeline drop

ce5ad81

updates tables in schema in nesting order

0c75bb4

makes encoding NotRequired in FileItem

b3b505d

makes filesystem source to follow row_order

beba420

explains partition and split loading, sql_database tests and examples

ce017b0

fixes add_limit max_items and legacy root key with tests

a85d683

Merge branch 'devel' into feat/explains-partition-and-split-loading

72af356

rudolfix force-pushed the feat/explains-partition-and-split-loading branch from 283e6fc to 72af356 Compare September 12, 2025 22:38

fixes docs link

6931005

rudolfix added 5 commits September 13, 2025 09:48

fixes and tests schema.drop_tables

6c602d1

improves docs, fixes links and tests

ccc3d59

tests, docs and regression fixes

8a99a21

also counts empty pages in add_limit

2bd8664

fixes scd2 tests

1424c4e

rudolfix marked this pull request as ready for review September 14, 2025 19:28

rudolfix added 2 commits September 14, 2025 23:11

fixes wrong root_key usage in scd2 sqlalchemy

c2f942a

fixes tests

15843a7

rudolfix force-pushed the feat/explains-partition-and-split-loading branch from d84391c to cfc2bbd Compare September 15, 2025 10:05

makes Incremental to return None on fully filtered batches

731a3b8

rudolfix self-assigned this Sep 15, 2025

rudolfix added the ci full Use to trigger CI on a PR for full load tests label Sep 15, 2025

rudolfix force-pushed the feat/explains-partition-and-split-loading branch from cfc2bbd to 731a3b8 Compare September 15, 2025 17:25

rudolfix requested a review from anuunchin September 15, 2025 17:27

anuunchin requested changes Sep 17, 2025

View reviewed changes

review fixes

46a2fdb

rudolfix added 2 commits September 20, 2025 00:16

Merge branch 'devel' into feat/explains-partition-and-split-loading

5270322

fixes tx scope in backfill db test

b9bb4dc

rudolfix merged commit 6f01555 into devel Sep 20, 2025
11 of 13 checks passed

rudolfix deleted the feat/explains-partition-and-split-loading branch September 20, 2025 09:53

	incremental (Optional[dlt.sources.incremental[Any]]): defines incremental cursor on listed files, with `modification_date`
	incremental (Optional[dlt.sources.incremental[Any]]): Defines incremental cursor on listed files, with `modification_date`

		while not pipeline.run(incremental_table.add_limit(2)).is_empty:
		pass

feat/explains partition and split loading #2737

feat/explains partition and split loading #2737

Uh oh!

Conversation

rudolfix commented Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

netlify bot commented Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for dlt-hub-docs canceled.

Uh oh!

github-actions bot commented Jul 11, 2025

⚠️ Possible file(s) that should be tracked in LFS detected ⚠️

Uh oh!

github-actions bot commented Jul 11, 2025

⚠️ Possible file(s) that should be tracked in LFS detected ⚠️

Uh oh!

anuunchin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloudflare-workers-and-pages bot commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying with Cloudflare Workers

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rudolfix commented Jun 9, 2025 •

edited

Loading

netlify bot commented Jun 9, 2025 •

edited

Loading

cloudflare-workers-and-pages bot commented Sep 19, 2025 •

edited

Loading