[Data] Add include_row_hash to read_parquet#61408
[Data] Add include_row_hash to read_parquet#61408wingkitlee0 wants to merge 5 commits intoray-project:masterfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a useful include_row_hash option to read_parquet, which is valuable for checkpointing and data versioning. The implementation is generally solid and consistent with existing features like include_paths. However, I've identified a critical bug that can cause a crash when include_row_hash=True is used on a file that already contains a row_hash column, particularly when no specific columns are selected for reading. I've provided details and a suggested fix for this issue. Additionally, I've included a few medium-severity suggestions to improve user experience by adding a warning for column name conflicts, updating the documentation to clarify this behavior, and enhancing test coverage for this edge case.
Signed-off-by: Kit Lee <7000003+wingkitlee0@users.noreply.github.com>
Signed-off-by: Kit Lee <7000003+wingkitlee0@users.noreply.github.com>
Signed-off-by: Kit Lee <7000003+wingkitlee0@users.noreply.github.com>
Signed-off-by: Kit Lee <7000003+wingkitlee0@users.noreply.github.com>
Signed-off-by: Kit Lee <7000003+wingkitlee0@users.noreply.github.com>
e1514fa to
2a97a98
Compare
Description
This PR adds an
include_row_hashoption toread_parquet.Row hash is unique for each row across the whole dataset. This can be used for checkpointing (for Ray Data and/or Ray Train pipeline)
Related issues
Closes #61410
Additional information
How it works:
identical data in different files always produces different hashes.
path_seed + row_index.
the full int64 range:
All operations are vectorized numpy — no Python loops.
Properties:
• Reproducible: Same file path + same row position always yields the same hash.
• Unique: Different files get different seeds (via MD5 of path); different rows within a file get different
offsets. The splitmix64 finalizer is bijective, so distinct inputs never collide.
• Fast: One MD5 call per file, then pure numpy vectorized arithmetic per batch.