Skip to content

[Data] Incorrect handling of partition columns when using the columns argument to ray.data.read_parquet #60215

@Tom-Newton

Description

@Tom-Newton

What happened + What you expected to happen

When using the columns argument of ray.data.read_parquet to exclude all the partition columns the resulting data incorrectly contains all the partition columns. The data should contain only the columns specified.

Versions / Dependencies

The bug was introduced by ray==2.53.0

Python 3.10.15
Ubuntu 24.04.3
Python packages:

$ pip freeze
attrs==25.4.0
certifi==2026.1.4
charset-normalizer==3.4.4
click==8.3.1
filelock==3.20.3
idna==3.11
jsonschema==4.26.0
jsonschema-specifications==2025.9.1
msgpack==1.1.2
numpy==2.2.6
packaging==25.0
pandas==2.3.3
protobuf==6.33.4
pyarrow==22.0.0
python-dateutil==2.9.0.post0
pytz==2025.2
PyYAML==6.0.3
ray==2.53.0
referencing==0.37.0
requests==2.32.5
rpds-py==0.30.0
six==1.17.0
typing_extensions==4.15.0
tzdata==2025.3
urllib3==2.6.3

Reproduction script

import tempfile

import pyarrow
import pyarrow.dataset

import ray

table = pyarrow.table(
    {
        "partition_column0": [1, 1, 3, 2, 2],
        "partition_column1": ["a", "a", "a", "a", "b"],
        "normal_column0": [10.5, 20.3, 15.7, 30.2, 25.8],
        "normal_column1": [130.5, 2670.3, 125.7, 370.2, 235.8],
    }
)

partition_columns = ["partition_column0", "partition_column1"]

with tempfile.TemporaryDirectory() as tmpdir:
    pyarrow.dataset.write_dataset(
        table,
        tmpdir,
        partitioning=partition_columns,
        partitioning_flavor="hive",
        format="parquet",
    )
    ray_dataset = ray.data.read_parquet(
        tmpdir,
        columns=["normal_column0"],
        partitioning=ray.data.datasource.partitioning.Partitioning("hive"),
    )
    print(ray_dataset.schema())
    print(ray_dataset.take_all())

which returns

2026-01-16 16:07:07,242 INFO parquet_datasource.py:1048 -- Estimated parquet encoding ratio is 0.016.
2026-01-16 16:07:07,242 INFO parquet_datasource.py:1108 -- Estimated parquet reader batch size at 14913081 rows
Column          Type
------          ----
normal_column0  double
2026-01-16 16:07:07,593 INFO logging.py:397 -- Registered dataset logger for dataset dataset_0_0
2026-01-16 16:07:07,602 INFO streaming_executor.py:178 -- Starting execution of Dataset dataset_0_0. Full logs are in /tmp/ray/session_2026-01-16_16-07-05_312498_632975/logs/ray-data
2026-01-16 16:07:07,602 INFO streaming_executor.py:179 -- Execution plan of Dataset dataset_0_0: InputDataBuffer[Input] -> TaskPoolMapOperator[ReadParquet]
2026-01-16 16:07:07,608 INFO streaming_executor.py:686 -- [dataset]: A new progress UI is available. To enable, set `ray.data.DataContext.get_current().enable_rich_progress_bars = True` and `ray.data.DataContext.get_current().use_ray_tqdm = False`.
2026-01-16 16:07:07,608 WARNING resource_manager.py:136 -- ⚠️  Ray's object store is configured to use only 42.9% of available memory (28.3GiB out of 66.1GiB total). For optimal Ray Data performance, we recommend setting the object store to at least 50% of available memory. You can do this by setting the 'object_store_memory' parameter when calling ray.init() or by setting the RAY_DEFAULT_OBJECT_STORE_MEMORY_PROPORTION environment variable.
2026-01-16 16:07:07,989 INFO streaming_executor.py:304 -- ✔️  Dataset dataset_0_0 execution finished in 0.39 seconds
[{'normal_column0': 10.5, 'partition_column0': '1', 'partition_column1': 'a'}, {'normal_column0': 20.3, 'partition_column0': '1', 'partition_column1': 'a'}, {'normal_column0': 30.2, 'partition_column0': '2', 'partition_column1': 'a'}, {'normal_column0': 25.8, 'partition_column0': '2', 'partition_column1': 'b'}, {'normal_column0': 15.7, 'partition_column0': '3', 'partition_column1': 'a'}]

If you look at the output you will see that .schema() correctly returns only normal_column0, but the data also contains both the partition columns. If you set columns to include one of the partition columns e.g. columns=["partition_column1", "normal_column0"] then it works correctly.

Issue Severity

Medium: It is a significant difficulty but I can work around it.

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions