Skip to content

[data]: url decode values in parse_hive_path#57625

Merged
bveeramani merged 17 commits intoray-project:masterfrom
lucaschadwicklam97:fix/urldecode_partition_vals
Nov 6, 2025
Merged

[data]: url decode values in parse_hive_path#57625
bveeramani merged 17 commits intoray-project:masterfrom
lucaschadwicklam97:fix/urldecode_partition_vals

Conversation

@lucaschadwicklam97
Copy link
Contributor

@lucaschadwicklam97 lucaschadwicklam97 commented Oct 10, 2025

url decoding partition values when read into arrow table

Why are these changes needed?

PyArrow URL-encodes partition values when writing to cloud storage. To ensure the values are consistent when you read them back, this PR updates the partitioning logic to URL-decode them. See apache/arrow#34905.

Closes #57564

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run pre-commit jobs to lint the changes in this PR. (pre-commit setup)
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@lucaschadwicklam97 lucaschadwicklam97 changed the title [Fix]: url decode values in parse_hive_path [data]: url decode values in parse_hive_path Oct 10, 2025

for partition_col, value in partition_col_values.items():
# decode the value from path
value = urllib.parse.unquote(value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you still need this change? It looks partition_col_values already gets decoded values passed from line 522, where we extract _parse_partition_column_values.
So seems like your changes in PathPartitionParser seems to be enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverted. thanks @gvspraveen

@lucaschadwicklam97 lucaschadwicklam97 force-pushed the fix/urldecode_partition_vals branch from a5994ca to 8545daf Compare October 13, 2025 18:01
@iamjustinhsu
Copy link
Contributor

iamjustinhsu commented Oct 14, 2025

can you run ./ci/lint/lint.sh? This will make the linting tests happy

kv_pairs = [d.split("=") for d in dirs] if dirs else []
kv_pairs = dict([d.split("=") for d in dirs] if dirs else [])
# url decode the partition values
kv_pairs = {k: urllib.parse.unquote(v) for k, v in kv_pairs.items()}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have u also looked into unquote_plus? It will also decode + signs in addition to what unquote supports. Just wanna make sure we covered this too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@iamjustinhsu
Copy link
Contributor

Thanks for the contribution! I like the changes :) One of tests is failing test_path_partition_filter_hive in https://buildkite.com/ray-project/microcheck/builds/28298#0199cee6-f2eb-42e7-9cb8-69978fe0e8c5/619-912. Can you take a look?

Lucas Lam added 3 commits October 14, 2025 16:03
Signed-off-by: Lucas Lam <laml2@github.com>
Signed-off-by: Lucas Lam <laml2@github.com>
Signed-off-by: Lucas Lam <laml2@github.com>
@lucaschadwicklam97 lucaschadwicklam97 force-pushed the fix/urldecode_partition_vals branch from 0ddc613 to ced3448 Compare October 14, 2025 21:03
@lucaschadwicklam97 lucaschadwicklam97 marked this pull request as ready for review October 14, 2025 21:04
@lucaschadwicklam97 lucaschadwicklam97 requested a review from a team as a code owner October 14, 2025 21:04
@ray-gardener ray-gardener bot added data Ray Data-related issues community-contribution Contributed by the community labels Oct 15, 2025
@gvspraveen gvspraveen requested a review from bveeramani October 17, 2025 16:24
Copy link
Contributor

@iamjustinhsu iamjustinhsu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Member

@bveeramani bveeramani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. Please add a test

"""
dirs = [d for d in dir_path.split("/") if d and (d.count("=") == 1)]
kv_pairs = [d.split("=") for d in dirs] if dirs else []
# url decode the partition values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a test for this case?

@bveeramani
Copy link
Member

I think the more holistic fix is to URL-decode when we list files from S3/GCS/HTTP, but that's a larger scope change, and I don't think we need to do it immediately

cursor[bot]

This comment was marked as outdated.

Signed-off-by: Lucas Lam <laml2@github.com>
@lucaschadwicklam97 lucaschadwicklam97 force-pushed the fix/urldecode_partition_vals branch from c630ead to eeffe37 Compare October 17, 2025 21:56
Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: lucaschadwicklam97 <52645624+lucaschadwicklam97@users.noreply.github.com>
cursor[bot]

This comment was marked as outdated.

Lucas Lam and others added 5 commits October 17, 2025 17:19
@github-actions
Copy link

github-actions bot commented Nov 6, 2025

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Nov 6, 2025
@lucaschadwicklam97
Copy link
Contributor Author

lucaschadwicklam97 commented Nov 6, 2025

@gvspraveen @iamjustinhsu any issue merging this PR? Added to existing test case and also double checked Cursor bot's recommendation to use unquote instead of unquote_plus. Its possible to have "+" literals in s3 paths, so unquote_plus("partition_value=foo+bar") == "foo bar" which is not what we want

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Copy link
Member

@bveeramani bveeramani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made some minor refactors. LGTM

@bveeramani bveeramani enabled auto-merge (squash) November 6, 2025 20:44
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Nov 6, 2025
@bveeramani bveeramani merged commit 5972d1d into ray-project:master Nov 6, 2025
7 of 8 checks passed
YoussefEssDS pushed a commit to YoussefEssDS/ray that referenced this pull request Nov 8, 2025
PyArrow URL-encodes partition values when writing to cloud storage. To
ensure the values are consistent when you read them back, this PR
updates the partitioning logic to URL-decode them. See
apache/arrow#34905.

Closes ray-project#57564

---------

Signed-off-by: Lucas Lam <laml2@github.com>
Signed-off-by: lucaschadwicklam97 <52645624+lucaschadwicklam97@users.noreply.github.com>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Co-authored-by: Lucas Lam <laml2@github.com>
Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
PyArrow URL-encodes partition values when writing to cloud storage. To
ensure the values are consistent when you read them back, this PR
updates the partitioning logic to URL-decode them. See
apache/arrow#34905.

Closes ray-project#57564

---------

Signed-off-by: Lucas Lam <laml2@github.com>
Signed-off-by: lucaschadwicklam97 <52645624+lucaschadwicklam97@users.noreply.github.com>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Co-authored-by: Lucas Lam <laml2@github.com>
Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
PyArrow URL-encodes partition values when writing to cloud storage. To
ensure the values are consistent when you read them back, this PR
updates the partitioning logic to URL-decode them. See
apache/arrow#34905.

Closes ray-project#57564

---------

Signed-off-by: Lucas Lam <laml2@github.com>
Signed-off-by: lucaschadwicklam97 <52645624+lucaschadwicklam97@users.noreply.github.com>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Co-authored-by: Lucas Lam <laml2@github.com>
Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
ykdojo pushed a commit to ykdojo/ray that referenced this pull request Nov 27, 2025
PyArrow URL-encodes partition values when writing to cloud storage. To
ensure the values are consistent when you read them back, this PR
updates the partitioning logic to URL-decode them. See
apache/arrow#34905.

Closes ray-project#57564

---------

Signed-off-by: Lucas Lam <laml2@github.com>
Signed-off-by: lucaschadwicklam97 <52645624+lucaschadwicklam97@users.noreply.github.com>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Co-authored-by: Lucas Lam <laml2@github.com>
Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: YK <1811651+ykdojo@users.noreply.github.com>
SheldonTsen pushed a commit to SheldonTsen/ray that referenced this pull request Dec 1, 2025
PyArrow URL-encodes partition values when writing to cloud storage. To
ensure the values are consistent when you read them back, this PR
updates the partitioning logic to URL-decode them. See
apache/arrow#34905.

Closes ray-project#57564

---------

Signed-off-by: Lucas Lam <laml2@github.com>
Signed-off-by: lucaschadwicklam97 <52645624+lucaschadwicklam97@users.noreply.github.com>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Co-authored-by: Lucas Lam <laml2@github.com>
Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>
Future-Outlier pushed a commit to Future-Outlier/ray that referenced this pull request Dec 7, 2025
PyArrow URL-encodes partition values when writing to cloud storage. To
ensure the values are consistent when you read them back, this PR
updates the partitioning logic to URL-decode them. See
apache/arrow#34905.

Closes ray-project#57564

---------

Signed-off-by: Lucas Lam <laml2@github.com>
Signed-off-by: lucaschadwicklam97 <52645624+lucaschadwicklam97@users.noreply.github.com>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Co-authored-by: Lucas Lam <laml2@github.com>
Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
PyArrow URL-encodes partition values when writing to cloud storage. To
ensure the values are consistent when you read them back, this PR
updates the partitioning logic to URL-decode them. See
apache/arrow#34905.

Closes ray-project#57564

---------

Signed-off-by: Lucas Lam <laml2@github.com>
Signed-off-by: lucaschadwicklam97 <52645624+lucaschadwicklam97@users.noreply.github.com>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Co-authored-by: Lucas Lam <laml2@github.com>
Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: peterxcli <peterxcli@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community data Ray Data-related issues go add ONLY when ready to merge, run all tests stale The issue is stale. It will be closed within 7 days unless there are further conversation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Data] partition_cols field is url encoded but not decoded when read back

4 participants