Skip to content

Conversation

@anuunchin
Copy link
Contributor

@anuunchin anuunchin commented Jul 3, 2025

Description

This PR adds the .gz extension to compressed files based on a new config disable_extension that is set to True by default.

In a nutshell:

  • is_compression_disabled() is removed and destination load jobs have access to BufferedDataWriterConfiguration.
  • Since the task involves getting rid of tricks like FileStorage.is_gzipped(), solely relying on BufferedDataWriterConfiguration is not sufficient when it comes to imported files, since they must not be compressed in any situation. For this reason, a new context class FileImportContext was introduced that tracks whether a file is an imported file and, thus, does not require compression. The is_compressed_file attribute of FileImportContext is then accessed by the duckdb and clickhouse load jobs to correctly set the compression type.

Related Issues

@anuunchin anuunchin self-assigned this Jul 3, 2025
@netlify
Copy link

netlify bot commented Jul 3, 2025

Deploy Preview for dlt-hub-docs canceled.

Name Link
🔨 Latest commit 633bcd6
🔍 Latest deploy log https://app.netlify.com/projects/dlt-hub-docs/deploys/688a07dfd32516000813add9

@anuunchin anuunchin closed this Jul 3, 2025
@anuunchin anuunchin reopened this Jul 3, 2025
@anuunchin anuunchin force-pushed the fix/gzip branch 7 times, most recently from 8118936 to 41285c5 Compare July 4, 2025 13:44
@anuunchin anuunchin marked this pull request as ready for review July 7, 2025 07:48
@anuunchin anuunchin force-pushed the fix/gzip branch 6 times, most recently from 879abc7 to e95ebee Compare July 7, 2025 12:15
@anuunchin anuunchin requested a review from sh-rp July 8, 2025 08:20
Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the direction is good and the only complication is backward compat for filesystem destination. I think @sh-rp wants to chip-in here.

with this PR we can remove some of weird things we do to discover if files are zipped.

is_compression_disabled() - this function should go away and whenever it is used there's something wrong. we are better if we use extension to guess if file is zipped.

another so so pattern used for example in clikckhouse.py

if ext == "jsonl":
                compression = "gz" if FileStorage.is_gzipped(file_path) else "none"

we actually probe the file to see if it is zipped

we remove all those tricks and just use extensiosn. the only place is filesystem destination where we must decide how/if we do backward compatibility

@sh-rp
Copy link
Collaborator

sh-rp commented Jul 9, 2025

I did not read the PR, but just from a user / compatibility perspective, I think dlt should behave the exact same way with regards to adding these extensions after updating if you do not change any settings. This means:

  • adding the gzip extension must be configurable
  • This setting must be set to off by default
  • It would be nicer to have it switched on by default, but then we would need to make this change in dlt 2.0 and mark it as breaking

@anuunchin anuunchin force-pushed the fix/gzip branch 4 times, most recently from 756528a to 04eff17 Compare July 11, 2025 10:53
Copy link
Collaborator

@sh-rp sh-rp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates, it is already looking good :) I have a bunch of changes that are related to making it clearer what is going on with better naming and less double code.

About maintaining backwards compatibility: I think the decision was to not have this setting but rather to maintain a version marker in the init file of the filesystem destination, and always add this extension for new filesystem datasets and maintain the old behavior for existing ones. Right now the init marker is just an empty file. Going forward, it should contain something l like "{'version': 1}" or similar. Depending on what it says in this file, the extensions will be added or not. I'll add details later today.

@sh-rp
Copy link
Collaborator

sh-rp commented Jul 17, 2025

After more discussion, I think we will need to maintain a filesystem versioning. So the proposal is this:

  • Remove this flag that governs wether to add the extension or not
  • Version the filesystem, we keep the version in the base init file. If this file is empty, we assume its content to be {'version': 1} if it does not exist on initialize_storage, we set it to {'version': CURRENT_VERSION} which is a constant that is 2. If we encounter anything else in this file (non empty and parsing it does not have a version 1 or 2) we raise.
  • An initialized filesystem instance should always know its current version, for now we do not do any migrations or anything like that, older datasets will have version 1, newer ones have version 2.
  • datasets with version 1 will continue to save the files without this extension.
  • datasets with version 2 will add this extension
  • We need to test this behavior with an older dlt version

@anuunchin anuunchin force-pushed the fix/gzip branch 4 times, most recently from ba2cf07 to f2bdc37 Compare July 22, 2025 14:53
@anuunchin anuunchin marked this pull request as ready for review July 22, 2025 14:53
Comment on lines 222 to 230
if (
table_name not in dlt_table_names
and self.remote_client.get_storage_version() == 1
and not is_compression_disabled()
):
from_statement = from_statement[:-1] + ", compression = 'gzip')"

Copy link
Contributor Author

@anuunchin anuunchin Jul 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still use this is_compression_disabled() trick because solely knowing the storage version is not enough as the files inside it may or may not be compressed. Without this if condition, the duckdb reader won't work with filesystem datasets of version 1.

PS: I put this as separately as possible, so that we can maybe remove this in the future 👀

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kudos for spotting this

Copy link
Collaborator

@sh-rp sh-rp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool! Just a few minor changes and cleanups :)

@anuunchin anuunchin force-pushed the fix/gzip branch 4 times, most recently from 57ed81b to 771e978 Compare July 27, 2025 12:57
Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this still need some work.

also AFAIK we are missing a test where we import both compressed and uncompressed file and then check if extensions was correctly added (ie. up to filesystem storage)

Comment on lines 222 to 230
if (
table_name not in dlt_table_names
and self.remote_client.get_storage_version() == 1
and not is_compression_disabled()
):
from_statement = from_statement[:-1] + ", compression = 'gzip')"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kudos for spotting this

@anuunchin anuunchin force-pushed the fix/gzip branch 2 times, most recently from 74b6b7f to 864bd2f Compare July 30, 2025 09:04
Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@rudolfix rudolfix merged commit 007953a into devel Aug 4, 2025
115 of 118 checks passed
@rudolfix rudolfix deleted the fix/gzip branch August 4, 2025 10:10
AyushPatel101 pushed a commit to AyushPatel101/dlt that referenced this pull request Aug 8, 2025
* enable_gz_extension added to client configs

--amend

* docs added

* Unnecessary flag removed, fs storage versioning added

* Redundancies removed, storage version cached

* Test for imported files improved

* Initial version stored separately
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Save compressed load files with .gz extension

4 participants