Skip to content

[Data] Add concurrency and deprecate parallelism for read_parquet APIs#42849

Merged
c21 merged 3 commits intoray-project:masterfrom
c21:read-concurrency
Feb 7, 2024
Merged

[Data] Add concurrency and deprecate parallelism for read_parquet APIs#42849
c21 merged 3 commits intoray-project:masterfrom
c21:read-concurrency

Conversation

@c21
Copy link
Contributor

@c21 c21 commented Jan 30, 2024

Why are these changes needed?

This PR is to add a new concurrency parameter for read APIs. The motivation is to allow users to control concurrency for read operator as well, other than map operators.

TODO: Add concurrency parameter for all read APIs besides read_parquet once we agree on the change. Otherwise too many documentation change needs to make.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Comment on lines +185 to +187
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read_parquet could launch more than one concurrent read task and this assertion would still pass. Is there any easy way to test that the concurrency is enforced?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, planning to add some test similar to est_backpressure_policies.py:test_basic. But good callout.

@c21 c21 requested a review from omatthew98 as a code owner February 7, 2024 01:11
c21 added 2 commits February 7, 2024 01:05
Signed-off-by: Cheng Su <scnju13@gmail.com>
Signed-off-by: Cheng Su <scnju13@gmail.com>
@c21 c21 force-pushed the read-concurrency branch from 044428e to 5480513 Compare February 7, 2024 09:05
@c21 c21 changed the title [Data] Add concurrency for read APIs [Data] Add concurrency and deprecate parallelism for read_parquet APIs Feb 7, 2024
Signed-off-by: Cheng Su <scnju13@gmail.com>
@raulchen
Copy link
Contributor

raulchen commented Feb 7, 2024

Just realized that we also need to update unit tests that use parallelism. but can we can do this in the next PR.

@c21
Copy link
Contributor Author

c21 commented Feb 7, 2024

Just realized that we also need to update unit tests that use parallelism. but can we can do this in the next PR.

Yes, those are okay and won't be broken. We can change in another PR.

@c21 c21 merged commit d9b296e into ray-project:master Feb 7, 2024
@c21 c21 deleted the read-concurrency branch February 7, 2024 20:21
ratnopam pushed a commit to ratnopam/ray that referenced this pull request Feb 11, 2024
ray-project#42849)

This PR is to add a new `concurrency` parameter for read APIs. The motivation is to allow users to control concurrency for read operator as well, other than map operators.

TODO: Add `concurrency` parameter for all read APIs besides `read_parquet` once we agree on the change. Otherwise too many documentation change needs to make.

Signed-off-by: Cheng Su <scnju13@gmail.com>
Signed-off-by: Ratnopam Chakrabarti <ratnopamc@yahoo.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants