[Data] Add include_row_hash to read_parquet by wingkitlee0 · Pull Request #61408 · ray-project/ray

wingkitlee0 · 2026-03-01T14:45:46Z

Description

This PR adds an include_row_hash option to read_parquet.

Row hash is unique for each row across the whole dataset. This can be used for checkpointing (for Ray Data and/or Ray Train pipeline)

Related issues

Closes #61410

Additional information

How it works:

Path seed: For each Parquet file, MD5-hash its file path and take the first 8 bytes as a uint64 seed. This means
identical data in different files always produces different hashes.
Row keys: Add the row's position within the file (0-based offset tracked across batches) to the path seed: key =
path_seed + row_index.
Mix: Apply the splitmix64 finalizer (a bijective 64-bit integer mixing function) to scatter nearby keys across
the full int64 range:

  keys ^= keys >> 30
  keys *= 0xBF58476D1CE4E5B9
  keys ^= keys >> 27
  keys *= 0x94D049BB133111EB
  keys ^= keys >> 31

All operations are vectorized numpy — no Python loops.
Properties:
• Reproducible: Same file path + same row position always yields the same hash.
• Unique: Different files get different seeds (via MD5 of path); different rows within a file get different
offsets. The splitmix64 finalizer is bijective, so distinct inputs never collide.
• Fast: One MD5 call per file, then pure numpy vectorized arithmetic per batch.

gemini-code-assist

Code Review

This pull request introduces a useful include_row_hash option to read_parquet, which is valuable for checkpointing and data versioning. The implementation is generally solid and consistent with existing features like include_paths. However, I've identified a critical bug that can cause a crash when include_row_hash=True is used on a file that already contains a row_hash column, particularly when no specific columns are selected for reading. I've provided details and a suggested fix for this issue. Additionally, I've included a few medium-severity suggestions to improve user experience by adding a warning for column name conflicts, updating the documentation to clarify this behavior, and enhancing test coverage for this edge case.

python/ray/data/_internal/datasource/parquet_datasource.py

python/ray/data/read_api.py

python/ray/data/tests/datasource/test_parquet.py

Signed-off-by: Kit Lee <7000003+wingkitlee0@users.noreply.github.com>

gemini-code-assist bot reviewed Mar 1, 2026

View reviewed changes

wingkitlee0 added 5 commits March 1, 2026 23:19

[Data] Add include_row_hash to read_parquet

2079129

Signed-off-by: Kit Lee <7000003+wingkitlee0@users.noreply.github.com>

formatting

ecfa603

Signed-off-by: Kit Lee <7000003+wingkitlee0@users.noreply.github.com>

formatting

84dd9d1

Signed-off-by: Kit Lee <7000003+wingkitlee0@users.noreply.github.com>

address pr comments

f33ae38

Signed-off-by: Kit Lee <7000003+wingkitlee0@users.noreply.github.com>

use uint64 consistently

2a97a98

Signed-off-by: Kit Lee <7000003+wingkitlee0@users.noreply.github.com>

wingkitlee0 force-pushed the kit/read-row-hash branch from e1514fa to 2a97a98 Compare March 2, 2026 04:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Add include_row_hash to read_parquet#61408

[Data] Add include_row_hash to read_parquet#61408
wingkitlee0 wants to merge 5 commits intoray-project:masterfrom
wingkitlee0:kit/read-row-hash

wingkitlee0 commented Mar 1, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wingkitlee0 commented Mar 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues

Additional information

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wingkitlee0 commented Mar 1, 2026 •

edited

Loading