BUG: read_parquet, to_parquet for s3 destinations #19135

maximveksler · 2018-01-08T12:51:32Z

closes to_parquet fails when S3 is the destination #19134
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff

closes #19134

TomAugspurger

Thanks. Could you add

tests (with a reference to the issue)
a release note in whatsnew/v0.23.0.txt

We have some other tests that use moto. Search for those to see how to structure them and lmk if you need any guidance.

TomAugspurger · 2018-01-08T14:39:08Z

pandas/io/common.py

@@ -169,7 +169,7 @@ def _stringify_path(filepath_or_buffer):


 def get_filepath_or_buffer(filepath_or_buffer, encoding=None,
-                           compression=None):
+                           compression=None, mode='rb'):


Add this to the Parameters section of the docstring, and note that it's only really used for S3 files.

codecov · 2018-01-08T17:40:13Z

Codecov Report

Merging #19135 into master will increase coverage by 0.04%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #19135      +/-   ##
==========================================
+ Coverage   91.52%   91.56%   +0.04%     
==========================================
  Files         147      148       +1     
  Lines       48775    48882     +107     
==========================================
+ Hits        44639    44759     +120     
+ Misses       4136     4123      -13

Flag	Coverage Δ
#multiple	`89.94% <100%> (+0.04%)`	⬆️
#single	`41.68% <33.33%> (+0.06%)`	⬆️

Impacted Files	Coverage Δ
pandas/io/common.py	`69.06% <100%> (ø)`	⬆️
pandas/io/s3.py	`86.36% <100%> (+1.36%)`	⬆️
pandas/io/parquet.py	`71.55% <100%> (+1.65%)`	⬆️
pandas/core/dtypes/cast.py	`87.98% <0%> (-0.6%)`	⬇️
pandas/core/strings.py	`98.17% <0%> (-0.3%)`	⬇️
pandas/core/reshape/melt.py	`97.19% <0%> (-0.06%)`	⬇️
pandas/core/indexes/timedeltas.py	`90.6% <0%> (-0.04%)`	⬇️
pandas/errors/__init__.py	`100% <0%> (ø)`	⬆️
pandas/core/categorical.py	`95.78% <0%> (ø)`	⬆️
pandas/core/base.py	`96.77% <0%> (ø)`	⬆️
... and 19 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8acdf80...9dbc77c. Read the comment docs.

pep8speaks · 2018-01-09T05:23:34Z

Hello @maximveksler! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on January 17, 2018 at 10:47 Hours UTC

maximveksler · 2018-01-09T07:21:14Z

@TomAugspurger any help on why unit test can't find pyarrow / fastparquet ?

TomAugspurger · 2018-01-09T12:06:50Z

Moto had some issues yesterday. Looking into it in a bit. Looks like there are some listing errors too.

…

________________________________ From: Maxim Veksler <[email protected]> Sent: Tuesday, January 9, 2018 1:21:17 AM To: pandas-dev/pandas Cc: Tom Augspurger; Mention Subject: Re: [pandas-dev/pandas] Fixes S3 to_parquet write to new path (#19135) @TomAugspurger<https://github.com/tomaugspurger> any help on why unit test can't find pyarrow / fastparquet ? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#19135 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ABQHIv72IMcFH-b9uk7XSJiGi4c_qIbuks5tIxNtgaJpZM4RWU04>.

maximveksler · 2018-01-09T13:17:15Z

Doesn't look like a moto issue, more like a unit test environment configuration. But I might be wrong here.. had just a quick glance and couldn't spot the issue,

maximveksler · 2018-01-09T13:21:34Z

More specifically

================================== FAILURES ===================================
______________________ TestIntegration.test_s3_roundtrip ______________________
[gw0] win32 -- Python 3.6.3 C:\Miniconda3_64\envs\pandas\python.exe
self = <pandas.tests.io.test_s3.TestIntegration object at 0x000000E2853F0400>
    def test_s3_roundtrip(self):
        expected = pd.DataFrame({'A': [1, 2, 3], 'B': 'foo'})
    
        boto3 = pytest.importorskip('boto3')
        moto = pytest.importorskip('moto')
    
        with moto.mock_s3():
            conn = boto3.resource("s3", region_name="us-east-1")
            conn.create_bucket(Bucket="pandas-test")
    
>           expected.to_parquet('s3://pandas-test/test.parquet')
pandas\tests\io\test_s3.py:24: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pandas\core\frame.py:1680: in to_parquet
    compression=compression, **kwargs)
pandas\io\parquet.py:227: in to_parquet
    return impl.write(df, path, compression=compression, **kwargs)
pandas\io\parquet.py:110: in write
    path, _, _ = get_filepath_or_buffer(path, mode='wb')
pandas\io\common.py:199: in get_filepath_or_buffer
    from pandas.io import s3
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
    """ s3 support for remote file interactivity """
    from pandas import compat
    try:
        import s3fs
        from botocore.exceptions import NoCredentialsError
    except:
>       raise ImportError("The s3fs library is required to handle s3 files")
E       ImportError: The s3fs library is required to handle s3 files
pandas\io\s3.py:7: ImportError

TomAugspurger · 2018-01-09T13:48:41Z

pandas/tests/io/test_s3.py


 class TestS3URL(object):

    def test_is_s3_url(self):
        assert _is_s3_url("s3://pandas/somethingelse.com")
        assert not _is_s3_url("s4://pandas/somethingelse.com")
+
+class TestIntegration(object):


Move this to pandas/tests/io/test_parquet.py and make a new class TestS3, that's marked with tm.network.

Then your test methods should take an argument s3_resource to use that fixture:

pandas/pandas/tests/io/conftest.py

Line 29 in c753e1e

def s3_resource(tips_file, jsonl_file):

That will take care of all the skipping / mocking for you. You just have to write the test at that point.

TomAugspurger · 2018-01-09T15:07:49Z

pandas/tests/io/test_parquet.py

@@ -486,3 +486,20 @@ def test_filter_row_groups(self, fp):
                          row_group_offsets=1)
            result = read_parquet(path, fp, filters=[('a', '==', 0)])
        assert len(result) == 1
+
+class TestIntegrationWithS3(Base):
+    def test_s3_roundtrip(self):


This should take an s3_resource.

Then remove everything below except for

+ expected = pd.DataFrame({'A': [1, 2, 3], 'B': 'foo'}) + expected.to_parquet('s3://pandas-test/test.parquet') + actual = pd.read_parquet('s3://pandas-test/test.parquet') + + tm.assert_frame_equal(actual, expected) +

TomAugspurger · 2018-01-09T15:08:01Z

pandas/tests/io/test_s3.py

@@ -6,3 +6,4 @@ class TestS3URL(object):
    def test_is_s3_url(self):
        assert _is_s3_url("s3://pandas/somethingelse.com")
        assert not _is_s3_url("s4://pandas/somethingelse.com")
+


This may cause the linter to fail, not sure.

TomAugspurger · 2018-01-09T15:12:15Z

pandas/io/common.py

@@ -179,6 +179,7 @@ def get_filepath_or_buffer(filepath_or_buffer, encoding=None,
    filepath_or_buffer : a url, filepath (str, py.path.local or pathlib.Path),
                         or buffer
    encoding : the encoding to use to decode py3 bytes, default is 'utf-8'
+    mode : One of 'rb' or 'wb' or 'ab'. default: 'rb'


@jreback thoughts on this parameter? I think it'd be better to just remove it, and hardcode mode='wb' in the call to s3.get_filepath_or_buffer down below. That's essentially what we do for URLs with the BytesIO.

@jreback hard coding leads to failure to read from S3, an exception from s3fs https://github.com/dask/s3fs/blob/master/s3fs/core.py#L1005

I've decided to add it all the way in the call chain precisely for this reason. It might be possible to change s3fs implementation because from what I know S3 assets don't have a read/write notion in them, or split pandas code into 2 get_readble_filepath_or_buffer and get_writable_filepath_or_buffer but I don't feel I know the code base well enough judge.

Ah, yes sorry I missed that.

In that case let's update the docstring to be numpydoc compliant: http://numpydoc.readthedocs.io/en/latest/format.html#sections

mode : {'rb', 'wb', 'ab'}

NP, my pleasure.

jreback · 2018-01-09T16:12:08Z

yeah this parameter should not be in he common call rather in the s3 specific

TomAugspurger · 2018-01-09T20:09:04Z

Looks like you'll need to skip the test if the engine isn't available. I think that pyarrow understands S3FileSystem objects, so you should be able to parametrize on engine=['fastparquet', 'pyarrow'].

TomAugspurger · 2018-01-09T20:09:46Z

You should just have to accept an engine argument, and it'll pick up that fixture.

maximveksler · 2018-01-09T20:16:01Z

Hey guys,

It's my 3rd attempt to fix the unit test. Not too excited to spam commit history with CI a/b testing attempts :)

Any way to reproduce the failing CI locally?

maximveksler · 2018-01-09T20:19:54Z

@TomAugspurger cool, looking into and thanks!

Q - If tests are failing because code can't find pyarrow / fastparquet and the fixture will cause the test to skip the unit test when it can't be found then... won't it defeat the whole purpose of the unit test ?!

Can't we just we have the CI server install pyarrow/fastparquet and make it available to the pytest instead?

TomAugspurger · 2018-01-09T20:20:34Z

The failure at https://circleci.com/gh/pandas-dev/pandas/8985#tests/containers/2 was because pyarrow / fastparquet weren't installed, and the test wasn't skipped. If you take an engine argument (and pass it through to the writer / reader) it should be OK.

To test locally, uninstall both pyarrow and fastparquet. The test should be skipped.

TomAugspurger · 2018-01-09T20:21:28Z

Can't we just we have the CI server install pyarrow/fastparquet and make it available to the pytest instead?

We do on some of our builds. But we also need to make sure pandas works without pyarrrow / fastparquet, so not all of our builds have them installed.

maximveksler · 2018-01-09T20:30:12Z

Tom, regarding your comment on pyarrow understanding S3FileSystem.

I think you're right, because fastparquet seems to not understand them.

I'm getting

============================= test session starts =============================
platform win32 -- Python 3.6.3, pytest-3.3.2, py-1.5.2, pluggy-0.6.0
rootdir: C:\Users\maxim.veksler\source\pandas, inifile: setup.cfg
collected 2 items
test_parquet.py F
pandas\tests\io\test_parquet.py:490 (TestIntegrationWithS3.test_s3_roundtrip[fastparquet])
self = <pandas.tests.io.test_parquet.TestIntegrationWithS3 object at 0x00000211F2652320>
df_compat =    A    B
0  1  foo
1  2  foo
2  3  foo
s3_resource = s3.ServiceResource(), engine = 'fastparquet'

    def test_s3_roundtrip(self, df_compat, s3_resource, engine):
        # GH #19134
>       df_compat.to_parquet('s3://pandas-test/test.parquet', engine)

test_parquet.py:493: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
..\..\core\frame.py:1680: in to_parquet
    compression=compression, **kwargs)
..\..\io\parquet.py:227: in to_parquet
    return impl.write(df, path, compression=compression, **kwargs)
..\..\io\parquet.py:200: in write
    compression=compression, **kwargs)
..\..\..\venv\lib\site-packages\fastparquet\writer.py:802: in write
    compression, open_with, has_nulls, append)
..\..\..\venv\lib\site-packages\fastparquet\writer.py:687: in write_simple
    with open_with(fn, mode) as f:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

f = <S3File pandas-test/test.parquet>, mode = 'wb'

    def default_open(f, mode='rb'):
>       return open(f, mode)
E       TypeError: expected str, bytes or os.PathLike object, not S3File

..\..\..\venv\lib\site-packages\fastparquet\util.py:44: TypeError
.                                                       [100%]

================================== FAILURES ===================================
____________ TestIntegrationWithS3.test_s3_roundtrip[fastparquet] _____________

self = <pandas.tests.io.test_parquet.TestIntegrationWithS3 object at 0x00000211F2652320>
df_compat =    A    B
0  1  foo
1  2  foo
2  3  foo
s3_resource = s3.ServiceResource(), engine = 'fastparquet'

    def test_s3_roundtrip(self, df_compat, s3_resource, engine):
        # GH #19134
>       df_compat.to_parquet('s3://pandas-test/test.parquet', engine)

test_parquet.py:493: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
..\..\core\frame.py:1680: in to_parquet
    compression=compression, **kwargs)
..\..\io\parquet.py:227: in to_parquet
    return impl.write(df, path, compression=compression, **kwargs)
..\..\io\parquet.py:200: in write
    compression=compression, **kwargs)
..\..\..\venv\lib\site-packages\fastparquet\writer.py:802: in write
    compression, open_with, has_nulls, append)
..\..\..\venv\lib\site-packages\fastparquet\writer.py:687: in write_simple
    with open_with(fn, mode) as f:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

f = <S3File pandas-test/test.parquet>, mode = 'wb'

    def default_open(f, mode='rb'):
>       return open(f, mode)
E       TypeError: expected str, bytes or os.PathLike object, not S3File

..\..\..\venv\lib\site-packages\fastparquet\util.py:44: TypeError
------------------------------ Captured log call ------------------------------
core.py                    203 DEBUG    Open S3 connection.  Anonymous: False
===================== 1 failed, 1 passed in 9.85 seconds ======================Exception ignored in: <bound method S3File.__del__ of <S3File pandas-test/test.parquet>>
Traceback (most recent call last):
  File "C:\Users\maxim.veksler\source\pandas\venv\lib\site-packages\s3fs\core.py", line 1219, in __del__
  File "C:\Users\maxim.veksler\source\pandas\venv\lib\site-packages\s3fs\core.py", line 1200, in close
  File "C:\Users\maxim.veksler\source\pandas\venv\lib\site-packages\s3fs\core.py", line 936, in _call_s3
  File "C:\Users\maxim.veksler\source\pandas\venv\lib\site-packages\s3fs\core.py", line 170, in _call_s3
  File "C:\Users\maxim.veksler\source\pandas\venv\lib\site-packages\botocore\client.py", line 317, in _api_call
  File "C:\Users\maxim.veksler\source\pandas\venv\lib\site-packages\botocore\client.py", line 602, in _make_api_call
  File "C:\Users\maxim.veksler\source\pandas\venv\lib\site-packages\botocore\endpoint.py", line 143, in make_request
  File "C:\Users\maxim.veksler\source\pandas\venv\lib\site-packages\botocore\endpoint.py", line 168, in _send_request
  File "C:\Users\maxim.veksler\source\pandas\venv\lib\site-packages\botocore\endpoint.py", line 152, in create_request
  File "C:\Users\maxim.veksler\source\pandas\venv\lib\site-packages\botocore\hooks.py", line 227, in emit
  File "C:\Users\maxim.veksler\source\pandas\venv\lib\site-packages\botocore\hooks.py", line 210, in _emit
  File "C:\Users\maxim.veksler\source\pandas\venv\lib\site-packages\botocore\signers.py", line 90, in handler
  File "C:\Users\maxim.veksler\source\pandas\venv\lib\site-packages\botocore\signers.py", line 154, in sign
  File "C:\Users\maxim.veksler\source\pandas\venv\lib\site-packages\botocore\auth.py", line 420, in add_auth
  File "C:\Users\maxim.veksler\source\pandas\venv\lib\site-packages\botocore\auth.py", line 354, in add_auth
ImportError: sys.meta_path is None, Python is likely shutting down

Should I just add an exception in the FastParquetImpl for when attempting to a write operation to s3?

TomAugspurger · 2018-01-09T20:36:16Z

Hmm, what version of s3fs and fastparquet do you have locally? I pulled your branch, added the engine argument, and they both pass for me.

pytest /Users/taugspurger/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/tests/io/test_parquet.py -k test_s3_roundtrip -v -rsx
=================================================================== test session starts ====================================================================
platform darwin -- Python 3.6.1, pytest-3.3.1, py-1.5.2, pluggy-0.6.0 -- /Users/taugspurger/Envs/pandas-dev/bin/python3.6
cachedir: .cache
rootdir: /Users/taugspurger/Envs/pandas-dev/lib/python3.6/site-packages/pandas, inifile: setup.cfg
plugins: xdist-1.15.0, rerunfailures-2.2, repeat-0.4.1, cov-2.5.1, annotate-1.0.0, hypothesis-3.44.9
collected 44 items

pandas/tests/io/test_parquet.py::TestIntegrationWithS3::test_s3_roundtrip[fastparquet] PASSED                                                        [ 50%]
pandas/tests/io/test_parquet.py::TestIntegrationWithS3::test_s3_roundtrip[pyarrow] PASSED                                                            [100%]

=================================================================== 42 tests deselected ====================================================================
========================================================= 2 passed, 42 deselected in 1.70 seconds ==========================================================

TomAugspurger · 2018-01-09T20:37:14Z

FYI, you may want to run flake8 pandas/tests/io/test_parquet.py. That'll catch any linting errors. I think you need another newline before your new test class.

maximveksler · 2018-01-09T20:47:32Z

fastparquet: 0.1.3
pyarrow: 0.8.0

obtained with pd.show_versions()

I find it odd that the test is passing for you. It should have failed because I now discovered FastParquetImpl.write is also missing the get_filepath_or_buffer(path, mode='wb') fix.

Could you please verify that the test_s3_roundtrip(self, df_compat, s3_resource, engine) is invoked twice on your end? (once for pyarrow and once for fastparquet). Another difference might be that I'm on a Windows machine (don't ask... :/ )

Last - thanks for the flake8 tip, will do now.

maximveksler · 2018-01-09T21:02:12Z

s3fs: 0.1.2 (sorry, it's getting late)

TomAugspurger · 2018-01-09T21:07:10Z

My apologies, I forgot to pass engine through to the readers / writers. Fastparquet does indeed fail (probably since it's .write needs the mode keyword to go through as well).

Here's my current diff.

+
 class TestIntegrationWithS3(Base):
-    def test_s3_roundtrip(self, df_compat, s3_resource):
+    def test_s3_roundtrip(self, df_compat, s3_resource, engine):
         # GH #19134
-        df_compat.to_parquet('s3://pandas-test/test.parquet')
+        key = 's3://pandas-test/test-{}.parquet'.format(engine)
+        df_compat.to_parquet(key, engine=engine)

         expected = df_compat
-        actual = pd.read_parquet('s3://pandas-test/test.parquet')
+        actual = pd.read_parquet(key, engine=engine)

         tm.assert_frame_equal(expected, actual)
-

maximveksler · 2018-01-09T21:13:11Z

Yeah, fastparquet fails even with the fix.

f = <S3File pandas-test/test.parquet>, mode = 'wb'

    def default_open(f, mode='rb'):
>       return open(f, mode)
E       TypeError: expected str, bytes or os.PathLike object, not S3File

..\..\..\venv\lib\site-packages\fastparquet\util.py:44: TypeError

I think i'll leave it as a known limitation.

TomAugspurger · 2018-01-09T21:31:12Z

@maximveksler this is doable. In the fastparquet writer, do

-        path, _, _ = get_filepath_or_buffer(path)
+        path, _, _ = get_filepath_or_buffer(path, mode='wb')
         with catch_warnings(record=True):
             self.api.write(path, df,
-                           compression=compression, **kwargs)
+                           compression=compression,
+                           open_with=lambda path, mode: path,
+                           **kwargs)

That's basically telling fastparquet "we already have an open file, just use it".

jreback · 2018-01-09T21:33:56Z

pandas/tests/io/test_parquet.py

+
+        if engine == 'pyarrow':
+            df_compat.to_parquet('s3://pandas-test/test.parquet', engine)
+


don’t inherit Base nore use a class
make just a function
use the fixtures instead of engine directly

@jreback how about now? Am I not using the fixture already? Not sure what you mean by that.

maximveksler · 2018-01-09T21:48:50Z

@TomAugspurger please see latest commit, I've added fastparquet write and it seems to work (had to disable snappy) but read I can't seem to be able to get to work without changes in api.py within fastparquet itself.

TomAugspurger · 2018-01-09T21:59:16Z

@martindurant what's the easiest way to open a fastparquet.ParquetFile(fn) where file is an S3File? Or does it have to be a str? fastparquet.ParquetFile(file, open_with=lambda path, mode: path) doesn't quite work.

martindurant · 2018-01-09T22:04:54Z

No, there is no way to simply pass a file-file object to fastparquet.ParquetFile, since in general we are looking for multiple files and/or metadata.

fastparquet.ParquetFile(fn, open_with=s3.open)

where s3 is a S3FileSystem.
You could have any function of the form func(fn, mode).

jreback · 2018-01-14T21:37:11Z

lgtm. ping when changed & green.

jreback · 2018-01-14T21:37:52Z

@maximveksler keep in mind, PR's can take a while to get merged. we have quite a lot and quite a bit of activity. All PR's need review and feedback time.

maximveksler · 2018-01-15T11:29:43Z

@jreback got ya, NP..

TomAugspurger · 2018-01-15T15:11:25Z

pandas/io/parquet.py

@@ -190,6 +190,10 @@ def __init__(self):
        self.api = fastparquet

    def write(self, df, path, compression='snappy', **kwargs):
+        if is_s3_url(path):


In this if block, go from S3File -> url

path = 's3://{}'.format(path.path) kwargs['open_with'] = path.s3.open

See if that works?

jreback · 2018-01-15T15:13:30Z

pandas/io/common.py

@@ -179,6 +179,8 @@ def get_filepath_or_buffer(filepath_or_buffer, encoding=None,
    filepath_or_buffer : a url, filepath (str, py.path.local or pathlib.Path),
                         or buffer
    encoding : the encoding to use to decode py3 bytes, default is 'utf-8'
+    mode : {'rb', 'wb', 'ab'} applies to S3 where a write mandates opening the


no this is just distracting as it only applies to s3 and is simply a pass thru option. pls change.

jreback · 2018-01-15T15:14:27Z

pandas/tests/io/test_parquet.py


-            # repeat
-            to_parquet(df, path, engine, **write_kwargs)
-            result = pd.read_parquet(path, engine, **read_kwargs)


pls do it my way

maximveksler · 2018-01-15T23:24:49Z

pandas/io/parquet.py

-        parquet_file = self.api.ParquetFile(path)
+        if is_s3_url(path):
+            s3, _, _ = get_filepath_or_buffer(path)
+            parquet_file = self.api.ParquetFile(path, open_with=s3.s3.open)


Turns out guys, that we had both a write and a read problem for when using fastparquet through S3Filesystem. We now should have a good test coverage of both use cases, and a workable implementation. (hands crossed).

jreback · 2018-01-16T11:24:09Z

doc/source/whatsnew/v0.23.0.txt

@@ -415,6 +415,7 @@ I/O
 - Bug in :func:`read_sas` where a file with 0 variables gave an ``AttributeError`` incorrectly. Now it gives an ``EmptyDataError`` (:issue:`18184`)
 - Bug in :func:`DataFrame.to_latex()` where pairs of braces meant to serve as invisible placeholders were escaped (:issue:`18667`)
 - Bug in :func:`read_json` where large numeric values were causing an ``OverflowError`` (:issue:`18842`)
+- Bug in :func:`DataFrame.to_parquet` exception is thrown if write destination is S3 (:issue:`19134`)


where exception was raised

There are several possible exceptions in the fail chain.

3 different components

S3Filesystem,

pyarrow writer,

fastparquet reader & writer.

pyarrow - write attempt

FileNotFoundException or ValueError (depends on if file exists in S3 or not).

fastparquet - read attempt

Exception in attempting to concat str and S3File

fastparquet - write attempt

Exception in attempting to open path using default_open

where an exception was raised if the write destination is S3.

jreback · 2018-01-16T11:24:56Z

pandas/io/parquet.py

@@ -194,14 +194,25 @@ def write(self, df, path, compression='snappy', **kwargs):
        # thriftpy/protocol/compact.py:339:
        # DeprecationWarning: tostring() is deprecated.
        # Use tobytes() instead.
-        path, _, _ = get_filepath_or_buffer(path)
+
+        if is_s3_url(path):


add a comment on what is happening here

jreback · 2018-01-16T11:25:36Z

pandas/io/parquet.py

-        path, _, _ = get_filepath_or_buffer(path)
-        parquet_file = self.api.ParquetFile(path)
+        if is_s3_url(path):
+            s3, _, _ = get_filepath_or_buffer(path)


is there a reason not to directly call the s3.get_filepath_or_buffer here?

Not familiar with pandas code, i'm frankly not sure about existing design where common.py#get_filepath_or_buffer returns s3file, or str, of buffer but I'm holding back from making too many changes in my first PR... so I prefer to continue using what is implemented and working in for ex. PyArrowImpl#read where reading from s3 works.

jreback · 2018-01-16T11:26:08Z

pandas/tests/io/test_parquet.py


-            # repeat
-            to_parquet(df, path, engine, **write_kwargs)
-            result = pd.read_parquet(path, engine, **read_kwargs)


this needs updating

jreback

very small doc corrections. ping on green.

jreback · 2018-01-17T00:18:13Z

doc/source/whatsnew/v0.23.0.txt

@@ -415,6 +415,7 @@ I/O
 - Bug in :func:`read_sas` where a file with 0 variables gave an ``AttributeError`` incorrectly. Now it gives an ``EmptyDataError`` (:issue:`18184`)
 - Bug in :func:`DataFrame.to_latex()` where pairs of braces meant to serve as invisible placeholders were escaped (:issue:`18667`)
 - Bug in :func:`read_json` where large numeric values were causing an ``OverflowError`` (:issue:`18842`)
+- Bug in :func:`DataFrame.to_parquet` exception is thrown if write destination is S3 (:issue:`19134`)


where an exception was raised if the write destination is S3.

jreback · 2018-01-17T00:18:41Z

pandas/io/common.py

@@ -179,10 +179,11 @@ def get_filepath_or_buffer(filepath_or_buffer, encoding=None,
    filepath_or_buffer : a url, filepath (str, py.path.local or pathlib.Path),
                         or buffer
    encoding : the encoding to use to decode py3 bytes, default is 'utf-8'
+    mode : str, optional applies when opening S3 destinations for writing


mode : str, optional

jreback · 2018-01-17T00:19:01Z

pandas/io/parquet.py

+            # path is s3:// so we need to open the s3file in 'wb' mode.
+            # TODO: Support 'ab'
+            path, _, _ = get_filepath_or_buffer(path, mode='wb')
+            # And pass the opened s3file to the fastparquet internal impl.


blank line here

jreback · 2018-01-17T00:29:52Z

lgtm @maximveksler ping on green (may be a while as travis is currently in backlog on mac biulds)

maximveksler · 2018-01-17T00:31:06Z

@jreback NP, will do.

maximveksler · 2018-01-17T01:02:47Z

@jreback, @TomAugspurger i'm refactoring test_parquet.py. Please don't merge before ping from me.

jreback · 2018-01-17T01:09:19Z

@maximveksler this PR is ok
you can do one on top of this as a refactor

maximveksler · 2018-01-17T13:32:24Z

@jreback looks like travis is back online, could you please rerun the build ?

TomAugspurger · 2018-01-17T13:34:09Z

Looks like it's queued: https://travis-ci.org/pandas-dev/pandas/pull_requests

jreback · 2018-01-18T00:49:25Z

thanks @maximveksler

nice patch!

maximveksler · 2018-01-18T09:37:22Z

Appreciate the guidance and feedback loops @TomAugspurger @jreback @martindurant

TomAugspurger reviewed Jan 8, 2018

View reviewed changes

TomAugspurger reviewed Jan 9, 2018

View reviewed changes

jreback requested changes Jan 9, 2018

View reviewed changes

jreback added IO Data IO issues that don't fit into a more specific label IO Parquet parquet, feather labels Jan 10, 2018

Explain why fastparquet is unimplemented.

55c575d

TomAugspurger reviewed Jan 15, 2018

View reviewed changes

jreback requested changes Jan 15, 2018

View reviewed changes

pandas now knows how to play /w fastparquet on s3 (hackish..)

556c43f

maximveksler commented Jan 15, 2018

View reviewed changes

maximveksler changed the title ~~Fixes writing to_parquet for s3 destinations~~ Fix read_parquet, to_parquet for s3 destinations Jan 16, 2018

jreback requested changes Jan 16, 2018

View reviewed changes

maxim veksler added 2 commits January 16, 2018 17:30

PR review fixes.

22f1ae5

flake8

ce11de6

jreback approved these changes Jan 17, 2018

View reviewed changes

jreback added this to the 0.23.0 milestone Jan 17, 2018

Documentation fixes.

900a1c4

maximveksler changed the title ~~Fix read_parquet, to_parquet for s3 destinations~~ BUG: read_parquet, to_parquet for s3 destinations Jan 17, 2018

PEP8 (also need to restart build, CI queue dropped)

9dbc77c

jreback merged commit 6e0927e into pandas-dev:master Jan 18, 2018

maximveksler mentioned this pull request Jan 21, 2018

Refactor test_parquet.py to use check_round_trip at module level #19332

Merged

2 tasks

jorisvandenbossche mentioned this pull request Jan 28, 2018

Unable to save parquet file using s3 url #19429

Closed

maximveksler mentioned this pull request Apr 10, 2018

Allow writing to S3 paths #8508

Closed

jorisvandenbossche mentioned this pull request May 17, 2019

read_parquet S3 dir support #26388

Closed


		if engine == 'pyarrow':
		df_compat.to_parquet('s3://pandas-test/test.parquet', engine)

BUG: read_parquet, to_parquet for s3 destinations #19135

BUG: read_parquet, to_parquet for s3 destinations #19135

Conversation

maximveksler commented Jan 8, 2018 • edited Loading

TomAugspurger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Jan 8, 2018 • edited Loading

Codecov Report

pep8speaks commented Jan 9, 2018 • edited Loading

Comment last updated on January 17, 2018 at 10:47 Hours UTC

maximveksler commented Jan 9, 2018

TomAugspurger commented Jan 9, 2018 via email

maximveksler commented Jan 9, 2018

maximveksler commented Jan 9, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Jan 9, 2018

TomAugspurger commented Jan 9, 2018

TomAugspurger commented Jan 9, 2018

maximveksler commented Jan 9, 2018

maximveksler commented Jan 9, 2018

TomAugspurger commented Jan 9, 2018

TomAugspurger commented Jan 9, 2018

maximveksler commented Jan 9, 2018

TomAugspurger commented Jan 9, 2018

TomAugspurger commented Jan 9, 2018

maximveksler commented Jan 9, 2018

maximveksler commented Jan 9, 2018

TomAugspurger commented Jan 9, 2018 • edited Loading

maximveksler commented Jan 9, 2018

TomAugspurger commented Jan 9, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maximveksler commented Jan 9, 2018

TomAugspurger commented Jan 9, 2018

martindurant commented Jan 9, 2018 • edited Loading

jreback commented Jan 14, 2018

jreback commented Jan 14, 2018

maximveksler commented Jan 15, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pyarrow - write attempt

fastparquet - read attempt

fastparquet - write attempt

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Jan 17, 2018

maximveksler commented Jan 17, 2018

maximveksler commented Jan 17, 2018

jreback commented Jan 17, 2018

maximveksler commented Jan 17, 2018

TomAugspurger commented Jan 17, 2018

jreback commented Jan 18, 2018

maximveksler commented Jan 18, 2018 • edited Loading

maximveksler commented Jan 8, 2018 •

edited

Loading

codecov bot commented Jan 8, 2018 •

edited

Loading

pep8speaks commented Jan 9, 2018 •

edited

Loading

TomAugspurger commented Jan 9, 2018 •

edited

Loading

TomAugspurger commented Jan 9, 2018 •

edited

Loading

martindurant commented Jan 9, 2018 •

edited

Loading

maximveksler commented Jan 18, 2018 •

edited

Loading