Skip to content

REG: Fix read_parquet from file-like objects #34500

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Jun 12, 2020

Conversation

alimcmaster1
Copy link
Member

@alimcmaster1 alimcmaster1 commented May 31, 2020

Use arrow parquet.read_table opposed to ParquetDataset

@alimcmaster1 alimcmaster1 added IO Parquet parquet, feather Bug Regression Functionality that used to work in a prior pandas version and removed Bug labels May 31, 2020
@alimcmaster1 alimcmaster1 changed the title BUG: Fix regression in read_parquet from file-like objects REG: Fix read_parquet from file-like objects May 31, 2020
@alimcmaster1 alimcmaster1 added this to the 1.0.5 milestone May 31, 2020
@jorisvandenbossche
Copy link
Member

whatsnew entry - waiting on #34481

I merged that PR

@jorisvandenbossche
Copy link
Member

The original version (before #33632) used get_filepath_or_buffer, which eg also enables to read from urls. Do you know if that is tested?

buffer = BytesIO()
df_compat.to_parquet(buffer)
df_from_buf = pd.read_parquet(buffer)
print(df_from_buf)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't checkin print statements?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep thanks - I’m about to tidy this up.

@alimcmaster1
Copy link
Member Author

alimcmaster1 commented Jun 1, 2020

The original version (before #33632) used get_filepath_or_buffer, which eg also enables to read from urls. Do you know if that is tested?

Agree - previously untested and I missed this in #33632 - apologies for that. I've added a test case for this. @jorisvandenbossche

@pep8speaks
Copy link

pep8speaks commented Jun 2, 2020

Hello @alimcmaster1! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-06-12 18:15:00 UTC

result = parquet_ds.read_pandas(**kwargs).to_pandas()
fs = get_fs_for_path(path)
should_close = None
# Avoid calling get_filepath_or_buffer for s3/gcs URLs since
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have some similar logic on the fastparquet side. Should consolidate in the future: https://github.com/pandas-dev/pandas/blob/master/pandas/io/parquet.py#L188

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.
The url still needs to be changed before merging?

Do you think this should be safe to use for 1.0.5? (as the other option is to revert the original PR for 1.0.5, and keep this for master only)

cc @simonjayhawkins

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. i think ok for 1.0.5

@jorisvandenbossche jorisvandenbossche mentioned this pull request Jun 4, 2020
5 tasks
def test_parquet_read_from_url(self, df_compat):
# TODO:alimcmaster1 update with master URL
url = (
"https://raw.githubusercontent.com/alimcmaster1/pandas/"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might fail due to rate limits from github?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep fair point - we already do this in test_network.py and I think the decorator helps handle any failures. We could use https://pypi.org/project/pytest-localserver/ ? also I couldnt find docs that suggest what the rate limits are for raw.githubusercontent endpoints?

@alimcmaster1
Copy link
Member Author

alimcmaster1 commented Jun 4, 2020

Looks good to me.
The url still needs to be changed before merging?

Do you think this should be safe to use for 1.0.5? (as the other option is to revert the original PR for 1.0.5, and keep this for master only)

cc @simonjayhawkins

I was planning on updating the URL post merge. Other option is I can open a separate PR with just .parquet file so it exists on master. I think should be fine for 1.0.5 - any additional test cases you can think of would be helpful.

@jorisvandenbossche jorisvandenbossche modified the milestones: 1.0.5, 1.1 Jun 10, 2020
@jorisvandenbossche
Copy link
Member

I would propose to keep this for 1.1 (and we reverted to original patch in the 1.0.x branch for 1.0.5). @alimcmaster1 you can remove the whatsnew note then? (we still need to add a similar line to the v1.0.5.txt, but that should be done in a separate PR)

@alimcmaster1
Copy link
Member Author

I would propose to keep this for 1.1 (and we reverted to original patch in the 1.0.x branch for 1.0.5). @alimcmaster1 you can remove the whatsnew note then? (we still need to add a similar line to the v1.0.5.txt, but that should be done in a separate PR)

Yes makes sense - I’ll do this tomorrow.

@jreback
Copy link
Contributor

jreback commented Jun 12, 2020

this now doesn't close the issue as that's actually marked for 1.0.5?

The plan is to patch that separately right?

@TomAugspurger
Copy link
Contributor

IIUC, all that needs to be done is move the release note to 1.1.0.rst. I'll do that now.

@TomAugspurger
Copy link
Contributor

Actually, I'm sufficintly confused about what the appropriate whatsnew to describe the changes from 1.0.5 to 1.1 is, so I'll leave that to you @alimcmaster1.

@jorisvandenbossche
Copy link
Member

In this PR, the whatsnew only needs to be simply removed (this is fixing a regression compared to master, so doesn't need a whatsnew). Describing what's changed in 1.0.5, that's for a separate PR that gets backported.

Will do that now

@jorisvandenbossche
Copy link
Member

I also updated the URL to point to the master branch, so this is going to fail anyway here, thus merging directly

@jorisvandenbossche jorisvandenbossche merged commit 57d056a into pandas-dev:master Jun 12, 2020
@jorisvandenbossche
Copy link
Member

Thanks @alimcmaster1 !

@alimcmaster1
Copy link
Member Author

Thanks for fixing up @jorisvandenbossche - apologies I didn’t get to this. I’ll add the 1.0.5 whatsnew note you mentioned

@simonjayhawkins simonjayhawkins mentioned this pull request Jun 15, 2020
simonjayhawkins pushed a commit to simonjayhawkins/pandas that referenced this pull request Jun 15, 2020
… read_parquet from file-like objects)

Co-authored-by: Joris Van den Bossche <[email protected]>
jorisvandenbossche added a commit that referenced this pull request Jun 15, 2020
…uet from file-like objects) (#34787)

Co-authored-by: Joris Van den Bossche <[email protected]>
Co-authored-by: alimcmaster1 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Parquet parquet, feather Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants