Skip to content

Conversation

@steinitzu
Copy link
Contributor

@steinitzu steinitzu commented Jun 7, 2024

Description

  • Add _dlt_load_id to arrow tables in extract step before writing file to disk.
  • Remove the load_id adding logic from normalize

Should be backwards compatible, aside from the edge case where you upgrade dlt in between extract and normalize

Will add/check tests

Related Issues

Step 1 of #1317

Additional Context

@netlify
Copy link

netlify bot commented Jun 7, 2024

Deploy Preview for dlt-hub-docs canceled.

Name Link
🔨 Latest commit 0d4347a
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/6671fc488b59f000080c58ca

Comment on lines 232 to 236
# Inject the parts of normalize configuration that are used here
@with_config(
spec=ItemsNormalizerConfiguration, sections=(known_sections.NORMALIZE, "parquet_normalizer")
)
def __init__(self, *args: Any, add_dlt_load_id: bool = False, **kwargs: Any) -> None:
Copy link
Contributor Author

@steinitzu steinitzu Jun 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was my solution to support the old normalize config.
I think it makes sense to keep it and have it consistent with object normalizer config, rather than move to extract section.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm idea is very good.. but maybe you could decorate a method in this class. not init? so you call it just to retrieve config,

Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

I think we miss some tests. There's for sure test that checks if load_id is added. However now we need test two things:

  1. adding load_id in extract + sensitivity to your config "hack" - just pipeline.extract() arrow table and see if extracted parquet has load_id
  2. load some json data and request parquet in nromalize stage - this will create parquet file in normalizer and we test if load id is added there

@steinitzu steinitzu force-pushed the write-load-id-in-extract branch from 536f0f2 to 706dce5 Compare June 13, 2024 17:09
Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@steinitzu thx for the tests!
still a few issues with normalizing column names. pls check

@rudolfix rudolfix marked this pull request as ready for review June 13, 2024 22:06
@steinitzu
Copy link
Contributor Author

Updated with normalized identifiers :)

Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@steinitzu looks good! but new test is not passing. IMO some column mismatch?

@steinitzu
Copy link
Contributor Author

@steinitzu looks good! but new test is not passing. IMO some column mismatch?

@rudolfix this is fixed by b1be9c9
The issue was that schema.update_table was adding the column at the front. I'm not sure if it's right to use that method here since it creates the table in schema and that should be left for after all columns are computed?
Added check for _dlt_load_id being last in common tests now too.

@rudolfix
Copy link
Collaborator

@steinitzu looks good! but new test is not passing. IMO some column mismatch?

@rudolfix this is fixed by b1be9c9 The issue was that schema.update_table was adding the column at the front. I'm not sure if it's right to use that method here since it creates the table in schema and that should be left for after all columns are computed? Added check for _dlt_load_id being last in common tests now too.

this is really weird. I added one more tests and the columns are added at the end. current version is btw. better, update_table is done at the end to modify actual schema

@rudolfix rudolfix merged commit b267c70 into devel Jun 18, 2024
@rudolfix rudolfix deleted the write-load-id-in-extract branch June 18, 2024 21:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants