-
Notifications
You must be signed in to change notification settings - Fork 415
Fix: saving compressed load files with .gz extension #2835
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
✅ Deploy Preview for dlt-hub-docs canceled.
|
8118936 to
41285c5
Compare
879abc7 to
e95ebee
Compare
rudolfix
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the direction is good and the only complication is backward compat for filesystem destination. I think @sh-rp wants to chip-in here.
with this PR we can remove some of weird things we do to discover if files are zipped.
is_compression_disabled() - this function should go away and whenever it is used there's something wrong. we are better if we use extension to guess if file is zipped.
another so so pattern used for example in clikckhouse.py
if ext == "jsonl":
compression = "gz" if FileStorage.is_gzipped(file_path) else "none"
we actually probe the file to see if it is zipped
we remove all those tricks and just use extensiosn. the only place is filesystem destination where we must decide how/if we do backward compatibility
|
I did not read the PR, but just from a user / compatibility perspective, I think dlt should behave the exact same way with regards to adding these extensions after updating if you do not change any settings. This means:
|
756528a to
04eff17
Compare
sh-rp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates, it is already looking good :) I have a bunch of changes that are related to making it clearer what is going on with better naming and less double code.
About maintaining backwards compatibility: I think the decision was to not have this setting but rather to maintain a version marker in the init file of the filesystem destination, and always add this extension for new filesystem datasets and maintain the old behavior for existing ones. Right now the init marker is just an empty file. Going forward, it should contain something l like "{'version': 1}" or similar. Depending on what it says in this file, the extensions will be added or not. I'll add details later today.
|
After more discussion, I think we will need to maintain a filesystem versioning. So the proposal is this:
|
ba2cf07 to
f2bdc37
Compare
| if ( | ||
| table_name not in dlt_table_names | ||
| and self.remote_client.get_storage_version() == 1 | ||
| and not is_compression_disabled() | ||
| ): | ||
| from_statement = from_statement[:-1] + ", compression = 'gzip')" | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still use this is_compression_disabled() trick because solely knowing the storage version is not enough as the files inside it may or may not be compressed. Without this if condition, the duckdb reader won't work with filesystem datasets of version 1.
PS: I put this as separately as possible, so that we can maybe remove this in the future 👀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kudos for spotting this
sh-rp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool! Just a few minor changes and cleanups :)
57ed81b to
771e978
Compare
rudolfix
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this still need some work.
also AFAIK we are missing a test where we import both compressed and uncompressed file and then check if extensions was correctly added (ie. up to filesystem storage)
| if ( | ||
| table_name not in dlt_table_names | ||
| and self.remote_client.get_storage_version() == 1 | ||
| and not is_compression_disabled() | ||
| ): | ||
| from_statement = from_statement[:-1] + ", compression = 'gzip')" | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kudos for spotting this
74b6b7f to
864bd2f
Compare
rudolfix
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
* enable_gz_extension added to client configs --amend * docs added * Unnecessary flag removed, fs storage versioning added * Redundancies removed, storage version cached * Test for imported files improved * Initial version stored separately
Description
This PR adds the
.gzextension to compressed files based on a new configdisable_extensionthat is set toTrueby default.In a nutshell:
is_compression_disabled()is removed and destination load jobs have access toBufferedDataWriterConfiguration.FileStorage.is_gzipped(), solely relying onBufferedDataWriterConfigurationis not sufficient when it comes to imported files, since they must not be compressed in any situation. For this reason, a new context classFileImportContextwas introduced that tracks whether a file is an imported file and, thus, does not require compression. Theis_compressed_fileattribute ofFileImportContextis then accessed by the duckdb and clickhouse load jobs to correctly set the compression type.Related Issues
.gzextension #925