-
Notifications
You must be signed in to change notification settings - Fork 7.3k
Closed
Labels
bugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn'tcommunity-backlogdataRay Data-related issuesRay Data-related issuesregressionstability
Description
What happened + What you expected to happen
When using the columns argument of ray.data.read_parquet to exclude all the partition columns the resulting data incorrectly contains all the partition columns. The data should contain only the columns specified.
Versions / Dependencies
The bug was introduced by ray==2.53.0
Python 3.10.15
Ubuntu 24.04.3
Python packages:
$ pip freeze
attrs==25.4.0
certifi==2026.1.4
charset-normalizer==3.4.4
click==8.3.1
filelock==3.20.3
idna==3.11
jsonschema==4.26.0
jsonschema-specifications==2025.9.1
msgpack==1.1.2
numpy==2.2.6
packaging==25.0
pandas==2.3.3
protobuf==6.33.4
pyarrow==22.0.0
python-dateutil==2.9.0.post0
pytz==2025.2
PyYAML==6.0.3
ray==2.53.0
referencing==0.37.0
requests==2.32.5
rpds-py==0.30.0
six==1.17.0
typing_extensions==4.15.0
tzdata==2025.3
urllib3==2.6.3
Reproduction script
import tempfile
import pyarrow
import pyarrow.dataset
import ray
table = pyarrow.table(
{
"partition_column0": [1, 1, 3, 2, 2],
"partition_column1": ["a", "a", "a", "a", "b"],
"normal_column0": [10.5, 20.3, 15.7, 30.2, 25.8],
"normal_column1": [130.5, 2670.3, 125.7, 370.2, 235.8],
}
)
partition_columns = ["partition_column0", "partition_column1"]
with tempfile.TemporaryDirectory() as tmpdir:
pyarrow.dataset.write_dataset(
table,
tmpdir,
partitioning=partition_columns,
partitioning_flavor="hive",
format="parquet",
)
ray_dataset = ray.data.read_parquet(
tmpdir,
columns=["normal_column0"],
partitioning=ray.data.datasource.partitioning.Partitioning("hive"),
)
print(ray_dataset.schema())
print(ray_dataset.take_all())
which returns
2026-01-16 16:07:07,242 INFO parquet_datasource.py:1048 -- Estimated parquet encoding ratio is 0.016.
2026-01-16 16:07:07,242 INFO parquet_datasource.py:1108 -- Estimated parquet reader batch size at 14913081 rows
Column Type
------ ----
normal_column0 double
2026-01-16 16:07:07,593 INFO logging.py:397 -- Registered dataset logger for dataset dataset_0_0
2026-01-16 16:07:07,602 INFO streaming_executor.py:178 -- Starting execution of Dataset dataset_0_0. Full logs are in /tmp/ray/session_2026-01-16_16-07-05_312498_632975/logs/ray-data
2026-01-16 16:07:07,602 INFO streaming_executor.py:179 -- Execution plan of Dataset dataset_0_0: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadParquet]
2026-01-16 16:07:07,608 INFO streaming_executor.py:686 -- [dataset]: A new progress UI is available. To enable, set `ray.data.DataContext.get_current().enable_rich_progress_bars = True` and `ray.data.DataContext.get_current().use_ray_tqdm = False`.
2026-01-16 16:07:07,608 WARNING resource_manager.py:136 -- ⚠️ Ray's object store is configured to use only 42.9% of available memory (28.3GiB out of 66.1GiB total). For optimal Ray Data performance, we recommend setting the object store to at least 50% of available memory. You can do this by setting the 'object_store_memory' parameter when calling ray.init() or by setting the RAY_DEFAULT_OBJECT_STORE_MEMORY_PROPORTION environment variable.
2026-01-16 16:07:07,989 INFO streaming_executor.py:304 -- ✔️ Dataset dataset_0_0 execution finished in 0.39 seconds
[{'normal_column0': 10.5, 'partition_column0': '1', 'partition_column1': 'a'}, {'normal_column0': 20.3, 'partition_column0': '1', 'partition_column1': 'a'}, {'normal_column0': 30.2, 'partition_column0': '2', 'partition_column1': 'a'}, {'normal_column0': 25.8, 'partition_column0': '2', 'partition_column1': 'b'}, {'normal_column0': 15.7, 'partition_column0': '3', 'partition_column1': 'a'}]
If you look at the output you will see that .schema() correctly returns only normal_column0, but the data also contains both the partition columns. If you set columns to include one of the partition columns e.g. columns=["partition_column1", "normal_column0"] then it works correctly.
Issue Severity
Medium: It is a significant difficulty but I can work around it.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn'tcommunity-backlogdataRay Data-related issuesRay Data-related issuesregressionstability