Skip to content

[Data] - Only return selected data columns in hive partitioned parquet files#60236

Merged
bveeramani merged 5 commits intoray-project:masterfrom
goutamvenkat-anyscale:goutam/fix_pq_partition_bug
Jan 22, 2026
Merged

[Data] - Only return selected data columns in hive partitioned parquet files#60236
bveeramani merged 5 commits intoray-project:masterfrom
goutamvenkat-anyscale:goutam/fix_pq_partition_bug

Conversation

@goutamvenkat-anyscale
Copy link
Contributor

Description

Returning None when you don't have partition_columns selects all the partitions which is not the right behavior. Returning [] when no partition columns are selected.

Related issues

Closes #60215

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

@goutamvenkat-anyscale goutamvenkat-anyscale requested a review from a team as a code owner January 16, 2026 19:57
@goutamvenkat-anyscale goutamvenkat-anyscale added data Ray Data-related issues bug Something that is supposed to be working; but isn't go add ONLY when ready to merge, run all tests labels Jan 16, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly fixes an issue where partition columns were being included in the output even when not explicitly selected by the user. The change in _get_partition_columns to return an empty list [] instead of None when no partition columns are available ensures that no partition columns are added to the data blocks, which is the correct behavior under projection. The new regression test in test_parquet.py effectively validates this fix. The changes are logical and well-tested. I have one minor suggestion to improve a comment's clarity for future maintainability.

Signed-off-by: Goutam <goutam@anyscale.com>
@goutamvenkat-anyscale
Copy link
Contributor Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly addresses an issue where partition columns were being incorrectly included when not selected. The change in _get_partition_columns to return [] instead of None for datasets without partition columns (when a projection is active) is the right fix to prevent unwanted partitions from being added.

The new test test_parquet_read_partitioned_excludes_unrequested_partition_columns is a valuable addition for ensuring that select_columns() correctly excludes partition columns. However, I've noted that this test doesn't cover the specific code path modified in this PR. I've left a comment with a suggestion for an additional test case to ensure the fix is fully covered.

Overall, the change is good, and with the additional test coverage, it will be even better.

Signed-off-by: Goutam <goutam@anyscale.com>
@goutamvenkat-anyscale goutamvenkat-anyscale changed the title [Data] - Only return selected columns in hive partitioned parquet files [Data] - Only return selected data columns in hive partitioned parquet files Jan 16, 2026
@alexeykudinkin alexeykudinkin enabled auto-merge (squash) January 20, 2026 17:57
@github-actions github-actions bot disabled auto-merge January 21, 2026 22:41
@bveeramani bveeramani merged commit 661f481 into ray-project:master Jan 22, 2026
6 checks passed
@goutamvenkat-anyscale goutamvenkat-anyscale deleted the goutam/fix_pq_partition_bug branch January 22, 2026 19:38
jinbum-kim pushed a commit to jinbum-kim/ray that referenced this pull request Jan 29, 2026
…t files (ray-project#60236)

## Description
Returning `None` when you don't have partition_columns selects all the
partitions which is not the right behavior. Returning `[]` when no
partition columns are selected.

## Related issues
Closes ray-project#60215

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: jinbum-kim <jinbum9958@gmail.com>
400Ping pushed a commit to 400Ping/ray that referenced this pull request Feb 1, 2026
…t files (ray-project#60236)

## Description
Returning `None` when you don't have partition_columns selects all the
partitions which is not the right behavior. Returning `[]` when no
partition columns are selected.

## Related issues
Closes ray-project#60215

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: 400Ping <jiekaichang@apache.org>
ryanaoleary pushed a commit to ryanaoleary/ray that referenced this pull request Feb 3, 2026
…t files (ray-project#60236)

## Description
Returning `None` when you don't have partition_columns selects all the
partitions which is not the right behavior. Returning `[]` when no
partition columns are selected.

## Related issues
Closes ray-project#60215 

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Goutam <goutam@anyscale.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
…t files (ray-project#60236)

## Description
Returning `None` when you don't have partition_columns selects all the
partitions which is not the right behavior. Returning `[]` when no
partition columns are selected.

## Related issues
Closes ray-project#60215

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
…t files (ray-project#60236)

## Description
Returning `None` when you don't have partition_columns selects all the
partitions which is not the right behavior. Returning `[]` when no
partition columns are selected.

## Related issues
Closes ray-project#60215

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Goutam <goutam@anyscale.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something that is supposed to be working; but isn't data Ray Data-related issues go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Data] Incorrect handling of partition columns when using the columns argument to ray.data.read_parquet

3 participants