[Data] Fixed `ParquetDatasource` encoding ratio estimation by alexeykudinkin · Pull Request #56268 · ray-project/ray

alexeykudinkin · 2025-09-05T06:20:52Z

Why are these changes needed?

This change is a follow-up for #56105.

Now dataset size estimation is based on listed file sizes. However, encoding ratio was still based on the file size estimates derived from the uncompressed data size obtained from Parquet metadata.

This change is addressing that by:

Rebasing encoding ratio to relate estimated in-memory size to the listed file size
Cleaning up unused abstractions (like ParquetMetadataProvider)

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

gemini-code-assist

Code Review

This pull request refactors the encoding ratio estimation in ParquetDatasource to be based on actual file sizes rather than Parquet metadata, which improves accuracy. This is achieved by introducing a _ParquetFragment wrapper to carry file size information. The change also includes a significant cleanup by removing the ParquetMetadataProvider and its associated abstractions, leading to simpler and more direct code. My review identifies a couple of critical issues where the new sampling logic doesn't correctly handle empty Parquet files, which could lead to runtime errors. I've also included some suggestions to improve type hints for better code clarity and maintainability.

python/ray/data/_internal/datasource/parquet_datasource.py

alexeykudinkin · 2025-09-05T07:48:56Z

python/ray/data/_internal/datasource/parquet_datasource.py

-    def deserialize(self) -> "ParquetFileFragment":
-        # Implicitly trigger S3 subsystem initialization by importing
-        # pyarrow.fs.
-        import pyarrow.fs  # noqa: F401
-
-        (file_format, path, filesystem, partition_expression) = cloudpickle.loads(
-            self._data
-        )
-        return file_format.make_fragment(path, filesystem, partition_expression)
-
-
-# Visible for test mocking.
-def _deserialize_fragments(
-    serialized_fragments: List[_NoIOSerializableFragmentWrapper],
-) -> List["pyarrow._dataset.ParquetFileFragment"]:
-    return [p.deserialize() for p in serialized_fragments]


alexeykudinkin · 2025-09-05T07:49:03Z

python/ray/data/_internal/datasource/parquet_datasource.py

-        try:
-            prefetch_remote_args = {}
-            prefetch_remote_args["num_cpus"] = NUM_CPUS_FOR_META_FETCH_TASK
-            if self._local_scheduling:
-                prefetch_remote_args["scheduling_strategy"] = self._local_scheduling
-            else:
-                # Use the scheduling strategy ("SPREAD" by default) provided in
-                # `DataContext``, to spread out prefetch tasks in cluster, avoid
-                # AWS S3 throttling error.
-                # Note: this is the same scheduling strategy used by read tasks.
-                prefetch_remote_args[
-                    "scheduling_strategy"
-                ] = DataContext.get_current().scheduling_strategy
-
-            self._metadata = [
-                ParquetFileMetadata(
-                    num_bytes=num_bytes,
-                )


python/ray/tests/test_runtime_env_working_dir.py

srinathk10

Minor comments. LGTM.

Also we can kick off below release test.
name:tpch_q1_fixed_size

python/ray/tests/test_runtime_env_working_dir_3.py

srinathk10 · 2025-09-05T16:34:53Z

python/ray/data/_internal/datasource/parquet_datasource.py

-) -> List["pyarrow._dataset.ParquetFileFragment"]:
-    return [p.deserialize() for p in serialized_fragments]
+    @staticmethod
+    def make_fragment(format, path, filesystem, partition_expression, file_size):


Can add type annotations here

Agree in principle.

These however are opaque deps we get from and wire back into Pyarrow. I can obviously wire their types, but i don't think that's gonna be very useful.

srinathk10 · 2025-09-05T16:36:04Z

python/ray/data/_internal/datasource/parquet_datasource.py


-        sample_infos = sample_fragments(
+        # Sample small number of parquet files to estimate
+        #   - Encoding ratio: ration of file size on disk to approximate expected


nit: 'ratio'

srinathk10 · 2025-09-05T16:58:40Z

python/ray/data/_internal/datasource/parquet_datasource.py

+        # 'avg_row_in_mem_bytes' is None if the sampled file was empty and 0 if the data
        # was all null.
-        if not sample_info.actual_bytes_per_row:
+        if not file_info or not file_info.avg_row_in_mem_bytes:


In line 630, should we do instead

max_parquet_reader_row_batch_size_bytes = ctx.target_max_block_size

Yeah, wasn't happy about it from the beginning but was thinking about leaving it as is to fix in separate PR.

Now that i'm thinking about it, no good reason not to fix it right away.

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

…n disk file size (instead of uncompressed byte size) Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Fixed batch-size estimation Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Tidying up; Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Limit read-ahead buffer to 1 batch Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

…ct#56268)   ## Why are these changes needed? This change is a follow-up for ray-project#56105. Now dataset size estimation is based on listed file sizes. However, encoding ratio was still based on the file size estimates derived from the uncompressed data size obtained from Parquet metadata. This change is addressing that by: - Rebasing encoding ratio to relate estimated in-memory size to the listed file size - Cleaning up unused abstractions (like `ParquetMetadataProvider`) ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com> Signed-off-by: sampan <sampan@anyscale.com>

…ct#56268)   ## Why are these changes needed? This change is a follow-up for ray-project#56105. Now dataset size estimation is based on listed file sizes. However, encoding ratio was still based on the file size estimates derived from the uncompressed data size obtained from Parquet metadata. This change is addressing that by: - Rebasing encoding ratio to relate estimated in-memory size to the listed file size - Cleaning up unused abstractions (like `ParquetMetadataProvider`) ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com> Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>

…ct#56268)   ## Why are these changes needed? This change is a follow-up for ray-project#56105. Now dataset size estimation is based on listed file sizes. However, encoding ratio was still based on the file size estimates derived from the uncompressed data size obtained from Parquet metadata. This change is addressing that by: - Rebasing encoding ratio to relate estimated in-memory size to the listed file size - Cleaning up unused abstractions (like `ParquetMetadataProvider`) ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com> Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>

## Why are these changes needed? This change is a follow-up for #56105. Now dataset size estimation is based on listed file sizes. However, encoding ratio was still based on the file size estimates derived from the uncompressed data size obtained from Parquet metadata. This change is addressing that by: - Rebasing encoding ratio to relate estimated in-memory size to the listed file size - Cleaning up unused abstractions (like `ParquetMetadataProvider`) ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

…ct#56268)   ## Why are these changes needed? This change is a follow-up for ray-project#56105. Now dataset size estimation is based on listed file sizes. However, encoding ratio was still based on the file size estimates derived from the uncompressed data size obtained from Parquet metadata. This change is addressing that by: - Rebasing encoding ratio to relate estimated in-memory size to the listed file size - Cleaning up unused abstractions (like `ParquetMetadataProvider`) ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com> Signed-off-by: zac <zac@anyscale.com>

## Why are these changes needed? This change is a follow-up for #56105. Now dataset size estimation is based on listed file sizes. However, encoding ratio was still based on the file size estimates derived from the uncompressed data size obtained from Parquet metadata. This change is addressing that by: - Rebasing encoding ratio to relate estimated in-memory size to the listed file size - Cleaning up unused abstractions (like `ParquetMetadataProvider`) ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>

…ct#56268)   ## Why are these changes needed? This change is a follow-up for ray-project#56105. Now dataset size estimation is based on listed file sizes. However, encoding ratio was still based on the file size estimates derived from the uncompressed data size obtained from Parquet metadata. This change is addressing that by: - Rebasing encoding ratio to relate estimated in-memory size to the listed file size - Cleaning up unused abstractions (like `ParquetMetadataProvider`) ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

alexeykudinkin requested review from a team as code owners September 5, 2025 06:20

alexeykudinkin added the go add ONLY when ready to merge, run all tests label Sep 5, 2025

gemini-code-assist bot reviewed Sep 5, 2025

View reviewed changes

ray-gardener bot added the data Ray Data-related issues label Sep 5, 2025

alexeykudinkin commented Sep 5, 2025

View reviewed changes

edoakes reviewed Sep 5, 2025

View reviewed changes

python/ray/tests/test_runtime_env_working_dir.py Outdated Show resolved Hide resolved

srinathk10 approved these changes Sep 5, 2025

View reviewed changes

alexeykudinkin enabled auto-merge (squash) September 5, 2025 22:53

alexeykudinkin added 21 commits September 5, 2025 17:53

Tidying up

ef32a06

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Deleting dead code

08784b8

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Wire in file metadata into sampling

d074910

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Refactored encoding ratio estimation to relate in-mem block size to o…

d191d92

…n disk file size (instead of uncompressed byte size) Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Delete dead code

612e8b9

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Moved file_size into _ParquetFragment

22df96a

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Estimated Read task output w/o using ParquetMetadataProvider

c6b0f66

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Deleted ParquetMetadataProvider

7c3e60a

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Updated tests

d371dab

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Revisited encoding ratio estimation;

60bcfa2

Fixed batch-size estimation Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

lint

e871c42

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Tidying up

d4a8425

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Serialize whole _NoIOSerializableParquetFragment

71cc589

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Tidying up

54bb387

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Fixing invalid refs

23893bc

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

lint

34b203e

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Deleting more dead code

e9c5f68

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Fixing more invalid refs

1327988

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Missing field

d2df4e4

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Fixing refs;

e5be04f

Tidying up; Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Fixed _fetch_parquet_file_info

3d28595

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

alexeykudinkin added 13 commits September 5, 2025 17:53

Fixed _estimate_in_mem_size

e0848f4

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Fixed refs

f9f2524

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Tidying up

f055055

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Updated tests

82ff049

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

lint

45b5e4d

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Avoid unnecessary wiring of opaque kwargs (messing up estimation);

a8f9bf6

Limit read-ahead buffer to 1 batch Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Fixed NPE handling

0fb1c1f

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

lint

c65a69d

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Fixed file-info fetching to only be performed on data cols

55a19d4

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Handle compat w/ Arrow 9.0

bf54370

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Revert

12c4c35

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Tidying up

8aab345

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Tidying up more

c6912bc

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

alexeykudinkin force-pushed the ak/pq-rd-est-fix branch from a858acd to c6912bc Compare September 6, 2025 00:53

github-actions bot disabled auto-merge September 6, 2025 00:53

alexeykudinkin added 4 commits September 5, 2025 18:04

Revisited batch-size estimation

dd255d6

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

lint

bfe58dc

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Fixing DZE

a3afbb1

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

Updated fixture

19a5c21

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

alexeykudinkin enabled auto-merge (squash) September 6, 2025 03:05

alexeykudinkin merged commit 7950cd1 into master Sep 6, 2025
6 checks passed

alexeykudinkin deleted the ak/pq-rd-est-fix branch September 6, 2025 04:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Fixed `ParquetDatasource` encoding ratio estimation#56268

[Data] Fixed `ParquetDatasource` encoding ratio estimation#56268
alexeykudinkin merged 41 commits intomasterfrom
ak/pq-rd-est-fix

alexeykudinkin commented Sep 5, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alexeykudinkin Sep 5, 2025

Uh oh!

alexeykudinkin Sep 5, 2025

Uh oh!

Uh oh!

srinathk10 left a comment

Uh oh!

Uh oh!

srinathk10 Sep 5, 2025

Uh oh!

alexeykudinkin Sep 5, 2025

Uh oh!

srinathk10 Sep 5, 2025

Uh oh!

srinathk10 Sep 5, 2025

Uh oh!

alexeykudinkin Sep 6, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

alexeykudinkin commented Sep 5, 2025

Why are these changes needed?

Related issue number

Checks

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alexeykudinkin Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

alexeykudinkin Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

srinathk10 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

srinathk10 Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

alexeykudinkin Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

srinathk10 Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

srinathk10 Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

alexeykudinkin Sep 6, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants