-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
REG: Fix read_parquet from file-like objects #34500
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REG: Fix read_parquet from file-like objects #34500
Conversation
I merged that PR |
The original version (before #33632) used |
pandas/tests/io/test_parquet.py
Outdated
buffer = BytesIO() | ||
df_compat.to_parquet(buffer) | ||
df_from_buf = pd.read_parquet(buffer) | ||
print(df_from_buf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't checkin print statements?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep thanks - I’m about to tidy this up.
Agree - previously untested and I missed this in #33632 - apologies for that. I've added a test case for this. @jorisvandenbossche |
Hello @alimcmaster1! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2020-06-12 18:15:00 UTC |
result = parquet_ds.read_pandas(**kwargs).to_pandas() | ||
fs = get_fs_for_path(path) | ||
should_close = None | ||
# Avoid calling get_filepath_or_buffer for s3/gcs URLs since |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have some similar logic on the fastparquet side. Should consolidate in the future: https://github.com/pandas-dev/pandas/blob/master/pandas/io/parquet.py#L188
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
The url still needs to be changed before merging?
Do you think this should be safe to use for 1.0.5? (as the other option is to revert the original PR for 1.0.5, and keep this for master only)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. i think ok for 1.0.5
pandas/tests/io/test_parquet.py
Outdated
def test_parquet_read_from_url(self, df_compat): | ||
# TODO:alimcmaster1 update with master URL | ||
url = ( | ||
"https://raw.githubusercontent.com/alimcmaster1/pandas/" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might fail due to rate limits from github?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep fair point - we already do this in test_network.py and I think the decorator helps handle any failures. We could use https://pypi.org/project/pytest-localserver/ ? also I couldnt find docs that suggest what the rate limits are for raw.githubusercontent endpoints?
I was planning on updating the URL post merge. Other option is I can open a separate PR with just .parquet file so it exists on master. I think should be fine for 1.0.5 - any additional test cases you can think of would be helpful. |
I would propose to keep this for 1.1 (and we reverted to original patch in the 1.0.x branch for 1.0.5). @alimcmaster1 you can remove the whatsnew note then? (we still need to add a similar line to the v1.0.5.txt, but that should be done in a separate PR) |
Yes makes sense - I’ll do this tomorrow. |
this now doesn't close the issue as that's actually marked for 1.0.5? The plan is to patch that separately right? |
IIUC, all that needs to be done is move the release note to 1.1.0.rst. I'll do that now. |
Actually, I'm sufficintly confused about what the appropriate whatsnew to describe the changes from 1.0.5 to 1.1 is, so I'll leave that to you @alimcmaster1. |
In this PR, the whatsnew only needs to be simply removed (this is fixing a regression compared to master, so doesn't need a whatsnew). Describing what's changed in 1.0.5, that's for a separate PR that gets backported. Will do that now |
I also updated the URL to point to the master branch, so this is going to fail anyway here, thus merging directly |
Thanks @alimcmaster1 ! |
Thanks for fixing up @jorisvandenbossche - apologies I didn’t get to this. I’ll add the 1.0.5 whatsnew note you mentioned |
… read_parquet from file-like objects) Co-authored-by: Joris Van den Bossche <[email protected]>
…uet from file-like objects) (#34787) Co-authored-by: Joris Van den Bossche <[email protected]> Co-authored-by: alimcmaster1 <[email protected]>
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
Use arrow parquet.read_table opposed to ParquetDataset