You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/website/docs/dlt-ecosystem/verified-sources/filesystem/basic.md
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -439,6 +439,8 @@ print(load_info)
439
439
### 6. Split large incremental loads
440
440
If you have many files to process or they are large you may choose to split pipeline runs into smaller chunks (where single file is the smallest). There are
441
441
two methods to do that:
442
+
* Partitioning where divide your source data in several ranges and load them (possibly in parallel) and then continue to load data incrementally.
443
+
* Split where you load data sequentially in small chunks
442
444
443
445
Partitioning works as follows:
444
446
1. Obtain a list of files ie. by just listing your resource `files = list(filesystem(...))`
Copy file name to clipboardExpand all lines: docs/website/docs/general-usage/incremental/cursor.md
+43-3Lines changed: 43 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -146,7 +146,7 @@ Note that dlt's incremental filtering considers the ranges half-closed. `initial
146
146
With the `row_order` argument set, dlt will stop retrieving data from the data source (e.g., GitHub API) if it detects that the values of the cursor field are out of the range of **start** and **end** values.
147
147
148
148
In particular:
149
-
* dlt stops processing when the resource yields any item with a cursor value _equal to or greater than_ the `end_value` and `row_order` is set to **asc**. (`end_value` is not included)
149
+
* dlt stops processing when the resource yields any item with a cursor value _equal to or greater than_ the `end_value` and `row_order` is set to **asc**. (`end_value` is not included, also see )
150
150
* dlt stops processing when the resource yields any item with a cursor value _lower_ than the `last_value` and `row_order` is set to **desc**. (`last_value` is included)
151
151
152
152
:::note
@@ -215,7 +215,6 @@ def tickets(
215
215
"updated_at",
216
216
initial_value="2023-01-01T00:00:00Z",
217
217
end_value="2023-02-01T00:00:00Z",
218
-
row_order="asc"
219
218
),
220
219
):
221
220
for page in zendesk_client.get_pages(
@@ -229,7 +228,48 @@ def tickets(
229
228
```
230
229
:::
231
230
232
-
## Deduplicate overlapping ranges with primary key
231
+
## Partition large loads
232
+
You can execute a backfill on large amount of data by partitioning it into smaller fragments. Best case is if you can partition.
233
+
234
+
235
+
:::Note
236
+
237
+
238
+
## Split large loads into chunks
239
+
You can split large incremental resources into smaller chunks and load them sequentially. This way you'll see the data quicker and
240
+
in case of loading error you are able to retry a single chunk. **This method works only if your source returns data in deterministic order**, for example:
241
+
* you can request your REST API endpoint to return data ordered by `updated_at`.
242
+
* you use `row_order` on one of supported sources like `sql_database` or `filesystem`.
243
+
244
+
Below we go for the second option and load data from messages table that we order on `created_at` column.
row_order="asc", # critical to set row_order when doing split loading
256
+
range_start="open", # use open range to disable deduplication
257
+
),
258
+
)
259
+
260
+
# produce chunk each minute
261
+
while pipeline.run(messages.add_limit(max_time=60)).has_data:
262
+
pass
263
+
```
264
+
Note how we combine `incremental` and `add_limit` to generate chunk each minute. If you create and index on `created_at`, the database
265
+
engine will be able to stream data using the index without the need to scan the whole table.
266
+
267
+
:::caution
268
+
If your source returns unordered data, you will most probably miss some data items or load them twice.
269
+
:::
270
+
271
+
272
+
## Deduplicate overlapping ranges
233
273
234
274
`Incremental`**does not** deduplicate datasets like the **merge** write disposition does. However, it ensures that when another portion of data is extracted, records that were previously loaded **at the end of range** won't be included again. `dlt` assumes that you load a range of data, where the lower bound is inclusive by default (i.e., greater than or equal). This ensures that you never lose any data but will also re-acquire some rows. For example, if you have a database table with a cursor field on `updated_at` which has a day resolution, then there's a high chance that after you extract data on a given day, more records will still be added. When you extract on the next day, you should reacquire data from the last day to ensure all records are present; however, this will create overlap with data from the previous extract.
0 commit comments