-
Notifications
You must be signed in to change notification settings - Fork 406
Write Deletion Vectors #2822
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Write Deletion Vectors #2822
Conversation
pyiceberg/table/puffin.py
Outdated
| bitmaps: dict[int, BitMap] = {} | ||
| cardinality = 0 | ||
| for pos in positions: | ||
| cardinality += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cardinality could be incorrect with same positions passed in we can probably use the pyroaring stats to get this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I made the proper change, let me know if you're thinking differently.
|
Hey @rambleraptor, I was working on a DV implementation before discovering this PR. Since review is already underway, I'd rather contribute here than duplicate effort. I've added a Spark interoperability test: glesperance/iceberg-python@c25fe312 This verifies pyiceberg can read Spark-written DVs. Combined with your existing round-trip tests, this confirms format compatibility... ie if the same reader handles both, Spark can read ours too. This may be redundant with your existing .bin fixture tests, though I believe those test the raw bitmap format rather than full Puffin DVs with the Java wrapper (length + magic + CRC). Let me know if I'm wrong on that. Happy to PR to your fork if you think it's pertinent -or- feel free to cherry pick the commit as you see fit. |
Verify pyiceberg's PuffinFile reader can parse deletion vectors written by Spark. Uses coalesce(1) to force Spark to create DVs instead of COW.
|
@glesperance Thanks so much! I patched in your commit and I'll push it up along with my changes. Your name should appear in the commit log + PR. Let me know if you don't see it. |
|
PR comments have been addressed. @geruh it looks like your work on DeleteFileIndexes will be very useful for determing offsets + lengths on the blobs! |
Part of #2261
Rationale for this change
This adds a PuffinWriter for writing deletion vectors.
Right now, it's just the writer class + some round trip tests (where we read + write the same file) to sanity check that the PuffinWriter works as expected. Writing Puffin files is very complex, so I wanted to make sure we all agreed on the writing semantics before using this elsewhere.
Let me know your thoughts on this (or if it's too granular)
Are these changes tested?
Unit tests included
Are there any user-facing changes?