[Data] - Only return selected data columns in hive partitioned parquet files#60236
Conversation
Signed-off-by: Goutam <goutam@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request correctly fixes an issue where partition columns were being included in the output even when not explicitly selected by the user. The change in _get_partition_columns to return an empty list [] instead of None when no partition columns are available ensures that no partition columns are added to the data blocks, which is the correct behavior under projection. The new regression test in test_parquet.py effectively validates this fix. The changes are logical and well-tested. I have one minor suggestion to improve a comment's clarity for future maintainability.
Signed-off-by: Goutam <goutam@anyscale.com>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request correctly addresses an issue where partition columns were being incorrectly included when not selected. The change in _get_partition_columns to return [] instead of None for datasets without partition columns (when a projection is active) is the right fix to prevent unwanted partitions from being added.
The new test test_parquet_read_partitioned_excludes_unrequested_partition_columns is a valuable addition for ensuring that select_columns() correctly excludes partition columns. However, I've noted that this test doesn't cover the specific code path modified in this PR. I've left a comment with a suggestion for an additional test case to ensure the fix is fully covered.
Overall, the change is good, and with the additional test coverage, it will be even better.
Signed-off-by: Goutam <goutam@anyscale.com>
…t files (ray-project#60236) ## Description Returning `None` when you don't have partition_columns selects all the partitions which is not the right behavior. Returning `[]` when no partition columns are selected. ## Related issues Closes ray-project#60215 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Goutam <goutam@anyscale.com> Signed-off-by: jinbum-kim <jinbum9958@gmail.com>
…t files (ray-project#60236) ## Description Returning `None` when you don't have partition_columns selects all the partitions which is not the right behavior. Returning `[]` when no partition columns are selected. ## Related issues Closes ray-project#60215 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Goutam <goutam@anyscale.com> Signed-off-by: 400Ping <jiekaichang@apache.org>
…t files (ray-project#60236) ## Description Returning `None` when you don't have partition_columns selects all the partitions which is not the right behavior. Returning `[]` when no partition columns are selected. ## Related issues Closes ray-project#60215 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Goutam <goutam@anyscale.com>
…t files (ray-project#60236) ## Description Returning `None` when you don't have partition_columns selects all the partitions which is not the right behavior. Returning `[]` when no partition columns are selected. ## Related issues Closes ray-project#60215 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Goutam <goutam@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
…t files (ray-project#60236) ## Description Returning `None` when you don't have partition_columns selects all the partitions which is not the right behavior. Returning `[]` when no partition columns are selected. ## Related issues Closes ray-project#60215 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Goutam <goutam@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
Description
Returning
Nonewhen you don't have partition_columns selects all the partitions which is not the right behavior. Returning[]when no partition columns are selected.Related issues
Closes #60215
Additional information