REG: Fix read_parquet from file-like objects #34500

alimcmaster1 · 2020-05-31T16:58:23Z

xref BUG: read_parquet no longer supports file-like objects #34467
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry - waiting on DOC: start 1.0.5 #34481

Use arrow parquet.read_table opposed to ParquetDataset

jorisvandenbossche · 2020-05-31T18:05:04Z

whatsnew entry - waiting on #34481

I merged that PR

jorisvandenbossche · 2020-05-31T18:08:43Z

The original version (before #33632) used get_filepath_or_buffer, which eg also enables to read from urls. Do you know if that is tested?

shubh2u · 2020-06-01T04:57:10Z

pandas/tests/io/test_parquet.py

+        buffer = BytesIO()
+        df_compat.to_parquet(buffer)
+        df_from_buf = pd.read_parquet(buffer)
+        print(df_from_buf)


Don't checkin print statements?

Yep thanks - I’m about to tidy this up.

alimcmaster1 · 2020-06-01T17:13:50Z

The original version (before #33632) used get_filepath_or_buffer, which eg also enables to read from urls. Do you know if that is tested?

Agree - previously untested and I missed this in #33632 - apologies for that. I've added a test case for this. @jorisvandenbossche

pep8speaks · 2020-06-02T09:05:43Z

Hello @alimcmaster1! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-06-12 18:15:00 UTC

alimcmaster1 · 2020-06-02T18:57:08Z

pandas/io/parquet.py

-        result = parquet_ds.read_pandas(**kwargs).to_pandas()
+        fs = get_fs_for_path(path)
+        should_close = None
+        # Avoid calling get_filepath_or_buffer for s3/gcs URLs since


We have some similar logic on the fastparquet side. Should consolidate in the future: https://github.com/pandas-dev/pandas/blob/master/pandas/io/parquet.py#L188

jorisvandenbossche

Looks good to me.
The url still needs to be changed before merging?

Do you think this should be safe to use for 1.0.5? (as the other option is to revert the original PR for 1.0.5, and keep this for master only)

cc @simonjayhawkins

jreback

lgtm. i think ok for 1.0.5

martindurant · 2020-06-04T13:26:52Z

pandas/tests/io/test_parquet.py

+    def test_parquet_read_from_url(self, df_compat):
+        # TODO:alimcmaster1 update with master URL
+        url = (
+            "https://raw.githubusercontent.com/alimcmaster1/pandas/"


This might fail due to rate limits from github?

Yep fair point - we already do this in test_network.py and I think the decorator helps handle any failures. We could use https://pypi.org/project/pytest-localserver/ ? also I couldnt find docs that suggest what the rate limits are for raw.githubusercontent endpoints?

alimcmaster1 · 2020-06-04T22:59:46Z

Looks good to me.
The url still needs to be changed before merging?

Do you think this should be safe to use for 1.0.5? (as the other option is to revert the original PR for 1.0.5, and keep this for master only)

cc @simonjayhawkins

I was planning on updating the URL post merge. Other option is I can open a separate PR with just .parquet file so it exists on master. I think should be fine for 1.0.5 - any additional test cases you can think of would be helpful.

jorisvandenbossche · 2020-06-10T06:39:40Z

I would propose to keep this for 1.1 (and we reverted to original patch in the 1.0.x branch for 1.0.5). @alimcmaster1 you can remove the whatsnew note then? (we still need to add a similar line to the v1.0.5.txt, but that should be done in a separate PR)

alimcmaster1 · 2020-06-11T22:32:14Z

I would propose to keep this for 1.1 (and we reverted to original patch in the 1.0.x branch for 1.0.5). @alimcmaster1 you can remove the whatsnew note then? (we still need to add a similar line to the v1.0.5.txt, but that should be done in a separate PR)

Yes makes sense - I’ll do this tomorrow.

jreback · 2020-06-12T17:14:50Z

this now doesn't close the issue as that's actually marked for 1.0.5?

The plan is to patch that separately right?

TomAugspurger · 2020-06-12T17:29:36Z

IIUC, all that needs to be done is move the release note to 1.1.0.rst. I'll do that now.

TomAugspurger · 2020-06-12T17:33:35Z

Actually, I'm sufficintly confused about what the appropriate whatsnew to describe the changes from 1.0.5 to 1.1 is, so I'll leave that to you @alimcmaster1.

jorisvandenbossche · 2020-06-12T18:08:58Z

In this PR, the whatsnew only needs to be simply removed (this is fixing a regression compared to master, so doesn't need a whatsnew). Describing what's changed in 1.0.5, that's for a separate PR that gets backported.

Will do that now

jorisvandenbossche · 2020-06-12T18:16:56Z

I also updated the URL to point to the master branch, so this is going to fail anyway here, thus merging directly

jorisvandenbossche · 2020-06-12T18:17:19Z

Thanks @alimcmaster1 !

alimcmaster1 · 2020-06-12T22:33:50Z

Thanks for fixing up @jorisvandenbossche - apologies I didn’t get to this. I’ll add the 1.0.5 whatsnew note you mentioned

… read_parquet from file-like objects) Co-authored-by: Joris Van den Bossche <[email protected]>

…uet from file-like objects) (#34787) Co-authored-by: Joris Van den Bossche <[email protected]> Co-authored-by: alimcmaster1 <[email protected]>

Use arrow parquet.read_table opposed to ParquetDataset

06c2696

alimcmaster1 added IO Parquet parquet, feather Bug Regression Functionality that used to work in a prior pandas version and removed Bug labels May 31, 2020

alimcmaster1 changed the title ~~BUG: Fix regression in read_parquet from file-like objects~~ REG: Fix read_parquet from file-like objects May 31, 2020

alimcmaster1 mentioned this pull request May 31, 2020

BUG: read_parquet no longer supports file-like objects #34467

Closed

alimcmaster1 added this to the 1.0.5 milestone May 31, 2020

shubh2u reviewed Jun 1, 2020

View reviewed changes

alimcmaster1 added 2 commits June 1, 2020 18:02

Merge remote-tracking branch 'upstream/master' into mcmali-parq-fix

03179ea

Importer skip

3f1496b

alimcmaster1 added 2 commits June 1, 2020 23:20

Add simple parquet file for read url tests

8122015

Parquet read from url tests

8cdf763

alimcmaster1 added 2 commits June 2, 2020 19:30

Handle S3 URLs seperately

daeb150

Add whatsnew

ee32b3d

alimcmaster1 commented Jun 2, 2020

View reviewed changes

alimcmaster1 added 3 commits June 2, 2020 20:05

Read file like fastparquet and pyarrow

9fa3178

Test just pyarrow

6ee9974

Skip if no arrow

92a883d

alimcmaster1 requested a review from jorisvandenbossche June 3, 2020 10:58

jorisvandenbossche reviewed Jun 3, 2020

View reviewed changes

jreback approved these changes Jun 3, 2020

View reviewed changes

jorisvandenbossche mentioned this pull request Jun 4, 2020

ENH: add fsspec support #34266

Merged

5 tasks

martindurant reviewed Jun 4, 2020

View reviewed changes

jorisvandenbossche mentioned this pull request Jun 7, 2020

BUG: s3 reads from public buckets not working #34626

Closed

3 tasks

jorisvandenbossche modified the milestones: 1.0.5, 1.1 Jun 10, 2020

remove whatsnew

5a15f4f

jorisvandenbossche approved these changes Jun 12, 2020

View reviewed changes

update url

882f5a8

jorisvandenbossche merged commit 57d056a into pandas-dev:master Jun 12, 2020

simonjayhawkins mentioned this pull request Jun 15, 2020

RLS: 1.0.5 #34684

Closed

simonjayhawkins pushed a commit to simonjayhawkins/pandas that referenced this pull request Jun 15, 2020

Backport Test Only from PR pandas-dev#34500 on branch 1.0.x (REG: Fix…

b17884b

… read_parquet from file-like objects) Co-authored-by: Joris Van den Bossche <[email protected]>

simonjayhawkins mentioned this pull request Jun 15, 2020

Backport Test Only from PR #34500 on branch 1.0.x (REG: Fix read_parquet from file-like objects) #34787

Merged

luke396 mentioned this pull request Jun 29, 2024

BUG: read_parquet wrongly returns empty index if asked to read empty column list #59028

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REG: Fix read_parquet from file-like objects #34500

REG: Fix read_parquet from file-like objects #34500

alimcmaster1 commented May 31, 2020 •

edited by jreback

Loading

jorisvandenbossche commented May 31, 2020

jorisvandenbossche commented May 31, 2020

shubh2u Jun 1, 2020

alimcmaster1 Jun 1, 2020

alimcmaster1 commented Jun 1, 2020 •

edited

Loading

pep8speaks commented Jun 2, 2020 •

edited

Loading

alimcmaster1 Jun 2, 2020

jorisvandenbossche left a comment

jreback left a comment

martindurant Jun 4, 2020

alimcmaster1 Jun 4, 2020

alimcmaster1 commented Jun 4, 2020 •

edited

Loading

jorisvandenbossche commented Jun 10, 2020

alimcmaster1 commented Jun 11, 2020

jreback commented Jun 12, 2020

TomAugspurger commented Jun 12, 2020

TomAugspurger commented Jun 12, 2020

jorisvandenbossche commented Jun 12, 2020

jorisvandenbossche commented Jun 12, 2020

jorisvandenbossche commented Jun 12, 2020

alimcmaster1 commented Jun 12, 2020

REG: Fix read_parquet from file-like objects #34500

REG: Fix read_parquet from file-like objects #34500

Conversation

alimcmaster1 commented May 31, 2020 • edited by jreback Loading

jorisvandenbossche commented May 31, 2020

jorisvandenbossche commented May 31, 2020

shubh2u Jun 1, 2020

Choose a reason for hiding this comment

alimcmaster1 Jun 1, 2020

Choose a reason for hiding this comment

alimcmaster1 commented Jun 1, 2020 • edited Loading

pep8speaks commented Jun 2, 2020 • edited Loading

Comment last updated at 2020-06-12 18:15:00 UTC

alimcmaster1 Jun 2, 2020

Choose a reason for hiding this comment

jorisvandenbossche left a comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

martindurant Jun 4, 2020

Choose a reason for hiding this comment

alimcmaster1 Jun 4, 2020

Choose a reason for hiding this comment

alimcmaster1 commented Jun 4, 2020 • edited Loading

jorisvandenbossche commented Jun 10, 2020

alimcmaster1 commented Jun 11, 2020

jreback commented Jun 12, 2020

TomAugspurger commented Jun 12, 2020

TomAugspurger commented Jun 12, 2020

jorisvandenbossche commented Jun 12, 2020

jorisvandenbossche commented Jun 12, 2020

jorisvandenbossche commented Jun 12, 2020

alimcmaster1 commented Jun 12, 2020

alimcmaster1 commented May 31, 2020 •

edited by jreback

Loading

alimcmaster1 commented Jun 1, 2020 •

edited

Loading

pep8speaks commented Jun 2, 2020 •

edited

Loading

alimcmaster1 commented Jun 4, 2020 •

edited

Loading