Skip to content

[Data] Add include_row_hash to read_parquet#61408

Draft
wingkitlee0 wants to merge 5 commits intoray-project:masterfrom
wingkitlee0:kit/read-row-hash
Draft

[Data] Add include_row_hash to read_parquet#61408
wingkitlee0 wants to merge 5 commits intoray-project:masterfrom
wingkitlee0:kit/read-row-hash

Conversation

@wingkitlee0
Copy link
Contributor

@wingkitlee0 wingkitlee0 commented Mar 1, 2026

Description

This PR adds an include_row_hash option to read_parquet.

Row hash is unique for each row across the whole dataset. This can be used for checkpointing (for Ray Data and/or Ray Train pipeline)

Related issues

Closes #61410

Additional information

How it works:

  1. Path seed: For each Parquet file, MD5-hash its file path and take the first 8 bytes as a uint64 seed. This means
    identical data in different files always produces different hashes.
  2. Row keys: Add the row's position within the file (0-based offset tracked across batches) to the path seed: key =
    path_seed + row_index.
  3. Mix: Apply the splitmix64 finalizer (a bijective 64-bit integer mixing function) to scatter nearby keys across
    the full int64 range:
  keys ^= keys >> 30
  keys *= 0xBF58476D1CE4E5B9
  keys ^= keys >> 27
  keys *= 0x94D049BB133111EB
  keys ^= keys >> 31

All operations are vectorized numpy — no Python loops.
Properties:
• Reproducible: Same file path + same row position always yields the same hash.
• Unique: Different files get different seeds (via MD5 of path); different rows within a file get different
offsets. The splitmix64 finalizer is bijective, so distinct inputs never collide.
• Fast: One MD5 call per file, then pure numpy vectorized arithmetic per batch.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a useful include_row_hash option to read_parquet, which is valuable for checkpointing and data versioning. The implementation is generally solid and consistent with existing features like include_paths. However, I've identified a critical bug that can cause a crash when include_row_hash=True is used on a file that already contains a row_hash column, particularly when no specific columns are selected for reading. I've provided details and a suggested fix for this issue. Additionally, I've included a few medium-severity suggestions to improve user experience by adding a warning for column name conflicts, updating the documentation to clarify this behavior, and enhancing test coverage for this edge case.

Signed-off-by: Kit Lee <7000003+wingkitlee0@users.noreply.github.com>
Signed-off-by: Kit Lee <7000003+wingkitlee0@users.noreply.github.com>
Signed-off-by: Kit Lee <7000003+wingkitlee0@users.noreply.github.com>
Signed-off-by: Kit Lee <7000003+wingkitlee0@users.noreply.github.com>
Signed-off-by: Kit Lee <7000003+wingkitlee0@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Data] Add include_row_hash to read_parquet

1 participant