pyarrow/pandas: add load id and dlt id in the extract phase and unify the behavior

**Background**
By default we do not add `load_id` and `dlt_id` to arrow tables. This must be configured explicitly and happens in the normalizer.
As a consequence, we need to decompress and rewrite parquet files which takes a lot of resources.
In this ticket we move this behavior to the extract phase. This is against general architecture but I do not see any other way to do that without rewriting files. 
We also unify the behavior making `relational` normalizer to follow `ItemsNormalizerConfiguration`

**Implementation**
We split this ticket into several PR. 
PR 1.
1. * [x] add `load_id`  in the extract phase.
2. * [x] make sure we do not clash with normalize which also add `load_id` (can we remove it from there?)
3. * [x] we (probably) do not need the logic that adds the columns when writing a file. we can just add them to existing table
4. * [x] `ItemsNormalizerConfiguration` must be taken into account. this is probably a breaking change because we need to move it from `normalize` to `extract` so old settings will stop working. or maybe you'll find a clever solution here :)

PR 2. relational alignment
1. * [ ] observe the configuration settings: do not add `_dlt_load_id` and `_dlt_id` if not configured. if nested tables are generated - fail. but provide good explanation why
2. * [ ] test what happens if configuration changes when table already exists. all those fields are non null so destinations should fail (but some of them still accept it). leave such changes to the destination

PR 3. arrow alignment
Fully unify arrow, model and relational normalizer. This will also prepare `dlt` to generate `nested` (not `json`) data types in the future.
1. * [x] add `dlt_id` generation. mind that we have a few ways to generate `dlt_id` which are found in `relational.py`. functions that decide on the type of the key that is used are static and you can extract them
2. * [ ] when adding _dlt_id we must follow table settings and generate `_dlt_id` according to hints (ie. SCD2 look  how `relational.py` generates different hashes.). also we have a fast method to generate content hashes: `add_row_hash_to_table`. it may be extended to only hash subset of columns
3. * [ ] observe "bring your own hash". if there's a column with unique, do not add `_dlt_id` (random one). if we have SCD2 type hash (please see SCD2 documentation on how to add it) we also skip it
4. * [ ] when we add new columns from pyarrow we should also infer hints like for any new columns. currently schema settings will be ignored (see. _infer_column but it must be modified to just infer hints). this, for example, happens in `_compute_table` (extract)
5. * [ ] enable `_dlt_load_id` by default. **

Ideally we'd add `_dlt_id` already in the extract phase, also infer columns properly. un-nesting may happen in normalize (so we have rewrite)

PR 4. model alignment
1. * [ ] make sure that names are normalized correctly and that name collisions on column names are detected
2. * [ ] like in arrow normalizer: do not allow to add `_dlt_id` and `_dlt_load_id` to tables that have seen data (there's util function for that). warn the user and skip it
3. * [ ] make sure that some model tests are run with common tests on CI

PR 5. unnesting in arrow normalizer
1. * [ ] observe max nesting level. explode lists and flatten structs, observe the same rules that generate nested types in relational.py see: https://chatgpt.com/share/e/66f06858-dfc8-8012-aff3-223105ddd11b  and #1793 the default behavior should be no unnesting for arrow (so it is backward compatible)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

pyarrow/pandas: add load id and dlt id in the extract phase and unify the behavior #1317

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

pyarrow/pandas: add load id and dlt id in the extract phase and unify the behavior #1317

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions