Skip to content

Commit 20f6b04

Browse files
authored
adds examples and step by step explanation for refresh modes (#1560)
1 parent 4ec728f commit 20f6b04

File tree

1 file changed

+58
-19
lines changed

1 file changed

+58
-19
lines changed

docs/website/docs/general-usage/pipeline.md

Lines changed: 58 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -98,33 +98,72 @@ You can reset parts or all of your sources by using the `refresh` argument to `d
9898
That means when you run the pipeline the sources/resources being processed will have their state reset and their tables either dropped or truncated
9999
depending on which refresh mode is used.
100100

101+
`refresh` option works with all relational/sql destinations and file buckets (`filesystem`). it does not work with vector databases (we are working on that) and
102+
with custom destinations.
103+
101104
The `refresh` argument should have one of the following string values to decide the refresh mode:
102105

103-
* `drop_sources`
104-
All sources being processed in `pipeline.run` or `pipeline.extract` are refreshed.
105-
That means all tables listed in their schemas are dropped and state belonging to those sources and all their resources is completely wiped.
106-
The tables are deleted both from pipeline's schema and from the destination database.
106+
### Drop tables and pipeline state for a source with `drop_sources`
107+
All sources being processed in `pipeline.run` or `pipeline.extract` are refreshed.
108+
That means all tables listed in their schemas are dropped and state belonging to those sources and all their resources is completely wiped.
109+
The tables are deleted both from pipeline's schema and from the destination database.
107110

108-
If you only have one source or run with all your sources together, then this is practically like running the pipeline again for the first time
111+
If you only have one source or run with all your sources together, then this is practically like running the pipeline again for the first time
109112

110-
:::caution
111-
This erases schema history for the selected sources and only the latest version is stored
112-
::::
113+
:::caution
114+
This erases schema history for the selected sources and only the latest version is stored
115+
:::
113116

114-
* `drop_resources`
115-
Limits the refresh to the resources being processed in `pipeline.run` or `pipeline.extract` (.e.g by using `source.with_resources(...)`).
116-
Tables belonging to those resources are dropped and their resource state is wiped (that includes incremental state).
117-
The tables are deleted both from pipeline's schema and from the destination database.
117+
```py
118+
import dlt
118119

119-
Source level state keys are not deleted in this mode (i.e. `dlt.state()[<'my_key>'] = '<my_value>'`)
120+
pipeline = dlt.pipeline("airtable_demo", destination="duckdb")
121+
pipeline.run(airtable_emojis(), refresh="drop_sources")
122+
```
123+
In example above we instruct `dlt` to wipe pipeline state belonging to `airtable_emojis` source and drop all the database tables in `duckdb` to
124+
which data was loaded. The `airtable_emojis` source had two resources named "📆 Schedule" and "💰 Budget" loading to tables "_schedule" and "_budget". Here's
125+
what `dlt` does step by step:
126+
1. collects a list of tables to drop by looking for all the tables in the schema that are created in the destination.
127+
2. removes existing pipeline state associated with `airtable_emojis` source
128+
3. resets the schema associated with `airtable_emojis` source
129+
4. executes `extract` and `normalize` steps. those will create fresh pipeline state and a schema
130+
5. before it executes `load` step, the collected tables are dropped from staging and regular dataset
131+
6. schema `airtable_emojis` (associated with the source) is be removed from `_dlt_version` table
132+
7. executes `load` step as usual so tables are re-created and fresh schema and pipeline state are stored.
133+
134+
### Selectively drop tables and resource state with `drop_resources`
135+
Limits the refresh to the resources being processed in `pipeline.run` or `pipeline.extract` (.e.g by using `source.with_resources(...)`).
136+
Tables belonging to those resources are dropped and their resource state is wiped (that includes incremental state).
137+
The tables are deleted both from pipeline's schema and from the destination database.
138+
139+
Source level state keys are not deleted in this mode (i.e. `dlt.state()[<'my_key>'] = '<my_value>'`)
140+
141+
:::caution
142+
This erases schema history for all affected sources and only the latest schema version is stored.
143+
:::
120144

121-
:::caution
122-
This erases schema history for all affected schemas and only the latest schema version is stored
123-
::::
145+
```py
146+
import dlt
124147

125-
* `drop_data`
126-
Same as `drop_resources` but instead of dropping tables from schema only the data is deleted from them (i.e. by `TRUNCATE <table_name>` in sql destinations). Resource state for selected resources is also wiped.
127-
The schema remains unmodified in this case.
148+
pipeline = dlt.pipeline("airtable_demo", destination="duckdb")
149+
pipeline.run(airtable_emojis().with_resources("📆 Schedule"), refresh="drop_resources")
150+
```
151+
Above we request that the state associated with "📆 Schedule" resource is reset and the table generated by it ("_schedule") is dropped. Other resources,
152+
tables and state are not affected. Please check `drop_sources` for step by step description of what `dlt` does internally.
153+
154+
### Selectively truncate tables and reset resource state with `drop_data`
155+
Same as `drop_resources` but instead of dropping tables from schema only the data is deleted from them (i.e. by `TRUNCATE <table_name>` in sql destinations). Resource state for selected resources is also wiped. In case of [incremental resources](incremental-loading.md#incremental-loading-with-a-cursor-field) this will
156+
reset the cursor state and fully reload the data from the `initial_value`.
157+
158+
The schema remains unmodified in this case.
159+
```py
160+
import dlt
161+
162+
pipeline = dlt.pipeline("airtable_demo", destination="duckdb")
163+
pipeline.run(airtable_emojis().with_resources("📆 Schedule"), refresh="drop_data")
164+
```
165+
Above the incremental state of the "📆 Schedule" is reset before `extract` step so data is fully reacquired. Just before `load` step starts,
166+
the "_schedule" is truncated and new (full) table data will be inserted/copied.
128167

129168
## Display the loading progress
130169

0 commit comments

Comments
 (0)