-
Notifications
You must be signed in to change notification settings - Fork 415
feat/explains partition and split loading #2737
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…on due to driver problems
… uses normalizer config prop, adds tests and docs
✅ Deploy Preview for dlt-hub-docs canceled.
|
|
|
283e6fc to
72af356
Compare
d84391c to
cfc2bbd
Compare
cfc2bbd to
731a3b8
Compare
anuunchin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧠
dlt/sources/filesystem/__init__.py
Outdated
| file_glob (str, optional): The filter to apply to the files in glob format. by default lists all files in bucket_url non-recursively | ||
| kwargs: (Optional[Dict[str, Any]], optional): Additional arguments passed to fsspec constructor ie. dict(use_ssl=True) for s3fs | ||
| client_kwargs: (Optional[Dict[str, Any]], optional): Additional arguments passed to underlying fsspec native client ie. dict(verify="public.crt) for botocore | ||
| incremental (Optional[dlt.sources.incremental[Any]]): defines incremental cursor on listed files, with `modification_date` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| incremental (Optional[dlt.sources.incremental[Any]]): defines incremental cursor on listed files, with `modification_date` | |
| incremental (Optional[dlt.sources.incremental[Any]]): Defines incremental cursor on listed files, with `modification_date` |
dlt/sources/filesystem/__init__.py
Outdated
| client_kwargs: (Optional[Dict[str, Any]]): Additional arguments passed to underlying fsspec native client ie. dict(verify="public.crt) for botocore | ||
| kwargs (Optional[Dict[str, Any]]): Additional arguments passed to fsspec constructor ie. dict(use_ssl=True) for s3fs | ||
| client_kwargs (Optional[Dict[str, Any]]): Additional arguments passed to underlying fsspec native client ie. dict(verify="public.crt) for botocore | ||
| incremental (Optional[dlt.sources.incremental[Any]]): defines incremental cursor on listed files, with `modification_date` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| incremental (Optional[dlt.sources.incremental[Any]]): defines incremental cursor on listed files, with `modification_date` | |
| incremental (Optional[dlt.sources.incremental[Any]]): Defines incremental cursor on listed files, with `modification_date` |
| state_only: bool = False, | ||
| sources: Optional[Union[Iterable[Union[str, TSimpleRegex]], Union[str, TSimpleRegex]]] = None, | ||
| ) -> _DropResult: | ||
| # sources: Optional[Union[Iterable[Union[str, TSimpleRegex]], Union[str, TSimpleRegex]]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we removing the source regex, because we're abandoning the idea of multiple sources per schema? 👀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
afair (correctly), the multiple sources per schema idea was one of the future todo's in one of the tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this code had no effect. right now there's 1:1 between schema and a source. so selecting many resources when schema is defined does not make sense
docs/website/docs/dlt-ecosystem/verified-sources/sql_database/advanced.md
Outdated
Show resolved
Hide resolved
| while not pipeline.run(incremental_table.add_limit(2)).is_empty: | ||
| pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is so cool!
| 3. **Set the comparison direction in the query**. By default greater than or equal op (**>=**) is used to compare initial/previous value with row column value. You can change it with `last_value_func` argument (**max**/**min**). | ||
| 4. **Set if the comparison is inclusive or exclusive**. By default the range is closed (equal values are included). [Look here for explanation and examples](advanced.md#inclusive-and-exclusive-filtering). Note that for closed ranges `dlt` will use [internal deduplication](../../../general-usage/incremental/cursor.md#deduplicate-overlapping-ranges) which adds some processing cost. | ||
| 4. **Configure backfill options(optional)**. You can use `end_value` with `range_end` to read data from specified range. You can also control **order returned rows** | ||
| to split long incremental loading into many chunks by time and row count. [Look here for details and examples] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like the part with order returned rows should be a separate point, or no? 👀
docs/website/docs/dlt-ecosystem/verified-sources/sql_database/troubleshooting.md
Outdated
Show resolved
Hide resolved
| 3. You can fix your nested tables in both staging and final datasets. Add `_dlt_root_id` to all nested tables and copy data | ||
| from related [root (top level) tables](../general-usage/schema.md#nested-references-root-and-nested-tables) `_dlt_id` (`row_key`). | ||
| In that case `dlt` will update pipeline schema but will skip database migration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But as a general rule, should we be endorsing approaches that involve manual tampering inside the destination outside of dlt's scope in cases like these? (i just feel like the first two approaches are more dlt- idiomatic 👀 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right! but sometimes you want to keep your data and (2) does not apply. it would be cool to have a tool that propagates root key post factum. but we do not have it :)
Deploying with
|
| Status | Name | Latest Commit | Updated (UTC) |
|---|---|---|---|
| ❌ Deployment failed View logs |
docs | b9bb4dc | Sep 20 2025, 09:57 AM |
Description
This PR documents and tests two methods of splitting pipeline runs in smaller chunks that may be useful for backfilling:
On top of that a few improvements are added:
sort_orderand may be used for split loadingLimitItemallows to count rows (not only yields/batches)see commit log for more