Skip to content

Conversation

@rambleraptor
Copy link
Contributor

Part of #2261

Rationale for this change

This adds a PuffinWriter for writing deletion vectors.

Right now, it's just the writer class + some round trip tests (where we read + write the same file) to sanity check that the PuffinWriter works as expected. Writing Puffin files is very complex, so I wanted to make sure we all agreed on the writing semantics before using this elsewhere.

Let me know your thoughts on this (or if it's too granular)

Are these changes tested?

Unit tests included

Are there any user-facing changes?

@geruh geruh mentioned this pull request Dec 10, 2025
14 tasks
bitmaps: dict[int, BitMap] = {}
cardinality = 0
for pos in positions:
cardinality += 1
Copy link
Contributor

@geruh geruh Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cardinality could be incorrect with same positions passed in we can probably use the pyroaring stats to get this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I made the proper change, let me know if you're thinking differently.

@glesperance
Copy link

glesperance commented Dec 11, 2025

Hey @rambleraptor, I was working on a DV implementation before discovering this PR. Since review is already underway, I'd rather contribute here than duplicate effort.

I've added a Spark interoperability test: glesperance/iceberg-python@c25fe312

This verifies pyiceberg can read Spark-written DVs. Combined with your existing round-trip tests, this confirms format compatibility... ie if the same reader handles both, Spark can read ours too.

This may be redundant with your existing .bin fixture tests, though I believe those test the raw bitmap format rather than full Puffin DVs with the Java wrapper (length + magic + CRC). Let me know if I'm wrong on that.

Happy to PR to your fork if you think it's pertinent -or- feel free to cherry pick the commit as you see fit.

Verify pyiceberg's PuffinFile reader can parse deletion vectors written
by Spark. Uses coalesce(1) to force Spark to create DVs instead of COW.
@rambleraptor
Copy link
Contributor Author

@glesperance Thanks so much! I patched in your commit and I'll push it up along with my changes. Your name should appear in the commit log + PR. Let me know if you don't see it.

@rambleraptor
Copy link
Contributor Author

PR comments have been addressed.

@geruh it looks like your work on DeleteFileIndexes will be very useful for determing offsets + lengths on the blobs!

@rambleraptor rambleraptor requested review from ebyhr and geruh December 12, 2025 00:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants