adds examples and step by step explanation for refresh modes (#1560)

rudolfix · web-flow · commit 20f6b048d28e · 2024-07-08T15:41:14.000+02:00
diff --git a/docs/website/docs/general-usage/pipeline.md b/docs/website/docs/general-usage/pipeline.md
@@ -98,33 +98,72 @@ You can reset parts or all of your sources by using the `refresh` argument to `d
 That means when you run the pipeline the sources/resources being processed will have their state reset and their tables either dropped or truncated
 depending on which refresh mode is used.
 
+`refresh` option works with all relational/sql destinations and file buckets (`filesystem`). it does not work with vector databases (we are working on that) and
+with custom destinations.
+
 The `refresh` argument should have one of the following string values to decide the refresh mode:
 
-* `drop_sources`
-  All sources being processed in `pipeline.run` or `pipeline.extract` are refreshed.
-  That means all tables listed in their schemas are dropped and state belonging to those sources and all their resources is completely wiped.
-  The tables are deleted both from pipeline's schema and from the destination database.
+### Drop tables and pipeline state for a source with `drop_sources`
+All sources being processed in `pipeline.run` or `pipeline.extract` are refreshed.
+That means all tables listed in their schemas are dropped and state belonging to those sources and all their resources is completely wiped.
+The tables are deleted both from pipeline's schema and from the destination database.
 
-  If you only have one source or run with all your sources together, then this is practically like running the pipeline again for the first time
+If you only have one source or run with all your sources together, then this is practically like running the pipeline again for the first time
 
-  :::caution
-  This erases schema history for the selected sources and only the latest version is stored
-  ::::
+:::caution
+This erases schema history for the selected sources and only the latest version is stored
+:::
 
-* `drop_resources`
-  Limits the refresh to the resources being processed in `pipeline.run` or `pipeline.extract` (.e.g by using `source.with_resources(...)`).
-  Tables belonging to those resources are dropped and their resource state is wiped (that includes incremental state).
-  The tables are deleted both from pipeline's schema and from the destination database.
+```py
+import dlt
 
-  Source level state keys are not deleted in this mode (i.e. `dlt.state()[<'my_key>'] = '<my_value>'`)
+pipeline = dlt.pipeline("airtable_demo", destination="duckdb")
+pipeline.run(airtable_emojis(), refresh="drop_sources")
+```
+In example above we instruct `dlt` to wipe pipeline state belonging to `airtable_emojis` source and drop all the database tables in `duckdb` to
+which data was loaded. The `airtable_emojis` source had two resources named "📆 Schedule" and "💰 Budget" loading to tables "_schedule" and "_budget". Here's
+what `dlt` does step by step:
+1. collects a list of tables to drop by looking for all the tables in the schema that are created in the destination.
+2. removes existing pipeline state associated with `airtable_emojis` source
+3. resets the schema associated with `airtable_emojis` source
+4. executes `extract` and `normalize` steps. those will create fresh pipeline state and a schema
+5. before it executes `load` step, the collected tables are dropped from staging and regular dataset
+6. schema `airtable_emojis` (associated with the source) is be removed from `_dlt_version` table
+7. executes `load` step as usual so tables are re-created and fresh schema and pipeline state are stored.
+
+### Selectively drop tables and resource state with `drop_resources`
+Limits the refresh to the resources being processed in `pipeline.run` or `pipeline.extract` (.e.g by using `source.with_resources(...)`).
+Tables belonging to those resources are dropped and their resource state is wiped (that includes incremental state).
+The tables are deleted both from pipeline's schema and from the destination database.
+
+Source level state keys are not deleted in this mode (i.e. `dlt.state()[<'my_key>'] = '<my_value>'`)
+
+:::caution
+This erases schema history for all affected sources and only the latest schema version is stored.
+:::
 
-  :::caution
-  This erases schema history for all affected schemas and only the latest schema version is stored
-  ::::
+```py
+import dlt
 
-* `drop_data`
-  Same as `drop_resources` but instead of dropping tables from schema only the data is deleted from them (i.e. by `TRUNCATE <table_name>` in sql destinations). Resource state for selected resources is also wiped.
-  The schema remains unmodified in this case.
+pipeline = dlt.pipeline("airtable_demo", destination="duckdb")
+pipeline.run(airtable_emojis().with_resources("📆 Schedule"), refresh="drop_resources")
+```
+Above we request that the state associated with "📆 Schedule" resource is reset and the table generated by it ("_schedule") is dropped. Other resources,
+tables and state are not affected. Please check `drop_sources` for step by step description of what `dlt` does internally.
+
+### Selectively truncate tables and reset resource state with `drop_data`
+Same as `drop_resources` but instead of dropping tables from schema only the data is deleted from them (i.e. by `TRUNCATE <table_name>` in sql destinations). Resource state for selected resources is also wiped. In case of [incremental resources](incremental-loading.md#incremental-loading-with-a-cursor-field) this will
+reset the cursor state and fully reload the data from the `initial_value`.
+
+The schema remains unmodified in this case.
+```py
+import dlt
+
+pipeline = dlt.pipeline("airtable_demo", destination="duckdb")
+pipeline.run(airtable_emojis().with_resources("📆 Schedule"), refresh="drop_data")
+```
+Above the incremental state of the "📆 Schedule" is reset before `extract` step so data is fully reacquired. Just before `load` step starts,
+ the "_schedule" is truncated and new (full) table data will be inserted/copied.
 
 ## Display the loading progress