[data]: url decode values in parse_hive_path#57625
[data]: url decode values in parse_hive_path#57625bveeramani merged 17 commits intoray-project:masterfrom
Conversation
|
|
||
| for partition_col, value in partition_col_values.items(): | ||
| # decode the value from path | ||
| value = urllib.parse.unquote(value) |
There was a problem hiding this comment.
do you still need this change? It looks partition_col_values already gets decoded values passed from line 522, where we extract _parse_partition_column_values.
So seems like your changes in PathPartitionParser seems to be enough?
a5994ca to
8545daf
Compare
|
can you run |
| kv_pairs = [d.split("=") for d in dirs] if dirs else [] | ||
| kv_pairs = dict([d.split("=") for d in dirs] if dirs else []) | ||
| # url decode the partition values | ||
| kv_pairs = {k: urllib.parse.unquote(v) for k, v in kv_pairs.items()} |
There was a problem hiding this comment.
have u also looked into unquote_plus? It will also decode + signs in addition to what unquote supports. Just wanna make sure we covered this too.
|
Thanks for the contribution! I like the changes :) One of tests is failing |
Signed-off-by: Lucas Lam <laml2@github.com>
Signed-off-by: Lucas Lam <laml2@github.com>
Signed-off-by: Lucas Lam <laml2@github.com>
0ddc613 to
ced3448
Compare
…aschadwicklam97/ray into fix/urldecode_partition_vals
bveeramani
left a comment
There was a problem hiding this comment.
Overall LGTM. Please add a test
| """ | ||
| dirs = [d for d in dir_path.split("/") if d and (d.count("=") == 1)] | ||
| kv_pairs = [d.split("=") for d in dirs] if dirs else [] | ||
| # url decode the partition values |
There was a problem hiding this comment.
Could you add a test for this case?
|
I think the more holistic fix is to URL-decode when we list files from S3/GCS/HTTP, but that's a larger scope change, and I don't think we need to do it immediately |
c630ead to
eeffe37
Compare
Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: lucaschadwicklam97 <52645624+lucaschadwicklam97@users.noreply.github.com>
Signed-off-by: Lucas Lam <laml2@github.com>
Signed-off-by: Lucas Lam <laml2@github.com>
…aschadwicklam97/ray into fix/urldecode_partition_vals
|
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
|
@gvspraveen @iamjustinhsu any issue merging this PR? Added to existing test case and also double checked Cursor bot's recommendation to use |
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
bveeramani
left a comment
There was a problem hiding this comment.
Made some minor refactors. LGTM
PyArrow URL-encodes partition values when writing to cloud storage. To ensure the values are consistent when you read them back, this PR updates the partitioning logic to URL-decode them. See apache/arrow#34905. Closes ray-project#57564 --------- Signed-off-by: Lucas Lam <laml2@github.com> Signed-off-by: lucaschadwicklam97 <52645624+lucaschadwicklam97@users.noreply.github.com> Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: Lucas Lam <laml2@github.com> Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>
PyArrow URL-encodes partition values when writing to cloud storage. To ensure the values are consistent when you read them back, this PR updates the partitioning logic to URL-decode them. See apache/arrow#34905. Closes ray-project#57564 --------- Signed-off-by: Lucas Lam <laml2@github.com> Signed-off-by: lucaschadwicklam97 <52645624+lucaschadwicklam97@users.noreply.github.com> Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: Lucas Lam <laml2@github.com> Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>
PyArrow URL-encodes partition values when writing to cloud storage. To ensure the values are consistent when you read them back, this PR updates the partitioning logic to URL-decode them. See apache/arrow#34905. Closes ray-project#57564 --------- Signed-off-by: Lucas Lam <laml2@github.com> Signed-off-by: lucaschadwicklam97 <52645624+lucaschadwicklam97@users.noreply.github.com> Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: Lucas Lam <laml2@github.com> Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: Aydin Abiar <aydin@anyscale.com>
PyArrow URL-encodes partition values when writing to cloud storage. To ensure the values are consistent when you read them back, this PR updates the partitioning logic to URL-decode them. See apache/arrow#34905. Closes ray-project#57564 --------- Signed-off-by: Lucas Lam <laml2@github.com> Signed-off-by: lucaschadwicklam97 <52645624+lucaschadwicklam97@users.noreply.github.com> Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: Lucas Lam <laml2@github.com> Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: YK <1811651+ykdojo@users.noreply.github.com>
PyArrow URL-encodes partition values when writing to cloud storage. To ensure the values are consistent when you read them back, this PR updates the partitioning logic to URL-decode them. See apache/arrow#34905. Closes ray-project#57564 --------- Signed-off-by: Lucas Lam <laml2@github.com> Signed-off-by: lucaschadwicklam97 <52645624+lucaschadwicklam97@users.noreply.github.com> Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: Lucas Lam <laml2@github.com> Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>
PyArrow URL-encodes partition values when writing to cloud storage. To ensure the values are consistent when you read them back, this PR updates the partitioning logic to URL-decode them. See apache/arrow#34905. Closes ray-project#57564 --------- Signed-off-by: Lucas Lam <laml2@github.com> Signed-off-by: lucaschadwicklam97 <52645624+lucaschadwicklam97@users.noreply.github.com> Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: Lucas Lam <laml2@github.com> Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: Future-Outlier <eric901201@gmail.com>
PyArrow URL-encodes partition values when writing to cloud storage. To ensure the values are consistent when you read them back, this PR updates the partitioning logic to URL-decode them. See apache/arrow#34905. Closes ray-project#57564 --------- Signed-off-by: Lucas Lam <laml2@github.com> Signed-off-by: lucaschadwicklam97 <52645624+lucaschadwicklam97@users.noreply.github.com> Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: Lucas Lam <laml2@github.com> Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: peterxcli <peterxcli@gmail.com>
url decoding partition values when read into arrow table
Why are these changes needed?
PyArrow URL-encodes partition values when writing to cloud storage. To ensure the values are consistent when you read them back, this PR updates the partitioning logic to URL-decode them. See apache/arrow#34905.
Closes #57564
Checks
git commit -s) in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.