Skip to content

BUG: read_parquet, to_parquet for s3 destinations #19135

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 25 commits into from
Jan 18, 2018

Conversation

maximveksler
Copy link
Contributor

@maximveksler maximveksler commented Jan 8, 2018

closes #19134

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Could you add

  1. tests (with a reference to the issue)
  2. a release note in whatsnew/v0.23.0.txt

We have some other tests that use moto. Search for those to see how to structure them and lmk if you need any guidance.

@@ -169,7 +169,7 @@ def _stringify_path(filepath_or_buffer):


def get_filepath_or_buffer(filepath_or_buffer, encoding=None,
compression=None):
compression=None, mode='rb'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add this to the Parameters section of the docstring, and note that it's only really used for S3 files.

@codecov
Copy link

codecov bot commented Jan 8, 2018

Codecov Report

Merging #19135 into master will increase coverage by 0.04%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #19135      +/-   ##
==========================================
+ Coverage   91.52%   91.56%   +0.04%     
==========================================
  Files         147      148       +1     
  Lines       48775    48882     +107     
==========================================
+ Hits        44639    44759     +120     
+ Misses       4136     4123      -13
Flag Coverage Δ
#multiple 89.94% <100%> (+0.04%) ⬆️
#single 41.68% <33.33%> (+0.06%) ⬆️
Impacted Files Coverage Δ
pandas/io/common.py 69.06% <100%> (ø) ⬆️
pandas/io/s3.py 86.36% <100%> (+1.36%) ⬆️
pandas/io/parquet.py 71.55% <100%> (+1.65%) ⬆️
pandas/core/dtypes/cast.py 87.98% <0%> (-0.6%) ⬇️
pandas/core/strings.py 98.17% <0%> (-0.3%) ⬇️
pandas/core/reshape/melt.py 97.19% <0%> (-0.06%) ⬇️
pandas/core/indexes/timedeltas.py 90.6% <0%> (-0.04%) ⬇️
pandas/errors/__init__.py 100% <0%> (ø) ⬆️
pandas/core/categorical.py 95.78% <0%> (ø) ⬆️
pandas/core/base.py 96.77% <0%> (ø) ⬆️
... and 19 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8acdf80...9dbc77c. Read the comment docs.

@pep8speaks
Copy link

pep8speaks commented Jan 9, 2018

Hello @maximveksler! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on January 17, 2018 at 10:47 Hours UTC

@maximveksler
Copy link
Contributor Author

@TomAugspurger any help on why unit test can't find pyarrow / fastparquet ?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jan 9, 2018 via email

@maximveksler
Copy link
Contributor Author

Doesn't look like a moto issue, more like a unit test environment configuration. But I might be wrong here.. had just a quick glance and couldn't spot the issue,

@maximveksler
Copy link
Contributor Author

More specifically

================================== FAILURES ===================================
______________________ TestIntegration.test_s3_roundtrip ______________________
[gw0] win32 -- Python 3.6.3 C:\Miniconda3_64\envs\pandas\python.exe
self = <pandas.tests.io.test_s3.TestIntegration object at 0x000000E2853F0400>
    def test_s3_roundtrip(self):
        expected = pd.DataFrame({'A': [1, 2, 3], 'B': 'foo'})
    
        boto3 = pytest.importorskip('boto3')
        moto = pytest.importorskip('moto')
    
        with moto.mock_s3():
            conn = boto3.resource("s3", region_name="us-east-1")
            conn.create_bucket(Bucket="pandas-test")
    
>           expected.to_parquet('s3://pandas-test/test.parquet')
pandas\tests\io\test_s3.py:24: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pandas\core\frame.py:1680: in to_parquet
    compression=compression, **kwargs)
pandas\io\parquet.py:227: in to_parquet
    return impl.write(df, path, compression=compression, **kwargs)
pandas\io\parquet.py:110: in write
    path, _, _ = get_filepath_or_buffer(path, mode='wb')
pandas\io\common.py:199: in get_filepath_or_buffer
    from pandas.io import s3
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
    """ s3 support for remote file interactivity """
    from pandas import compat
    try:
        import s3fs
        from botocore.exceptions import NoCredentialsError
    except:
>       raise ImportError("The s3fs library is required to handle s3 files")
E       ImportError: The s3fs library is required to handle s3 files
pandas\io\s3.py:7: ImportError


class TestS3URL(object):

def test_is_s3_url(self):
assert _is_s3_url("s3://pandas/somethingelse.com")
assert not _is_s3_url("s4://pandas/somethingelse.com")

class TestIntegration(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this to pandas/tests/io/test_parquet.py and make a new class TestS3, that's marked with tm.network.

Then your test methods should take an argument s3_resource to use that fixture:

def s3_resource(tips_file, jsonl_file):

That will take care of all the skipping / mocking for you. You just have to write the test at that point.

@@ -486,3 +486,20 @@ def test_filter_row_groups(self, fp):
row_group_offsets=1)
result = read_parquet(path, fp, filters=[('a', '==', 0)])
assert len(result) == 1

class TestIntegrationWithS3(Base):
def test_s3_roundtrip(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should take an s3_resource.

Then remove everything below except for

+            expected = pd.DataFrame({'A': [1, 2, 3], 'B': 'foo'}) 
+            expected.to_parquet('s3://pandas-test/test.parquet')
+            actual = pd.read_parquet('s3://pandas-test/test.parquet')
+
+            tm.assert_frame_equal(actual, expected)
+ 

@@ -6,3 +6,4 @@ class TestS3URL(object):
def test_is_s3_url(self):
assert _is_s3_url("s3://pandas/somethingelse.com")
assert not _is_s3_url("s4://pandas/somethingelse.com")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may cause the linter to fail, not sure.

@@ -179,6 +179,7 @@ def get_filepath_or_buffer(filepath_or_buffer, encoding=None,
filepath_or_buffer : a url, filepath (str, py.path.local or pathlib.Path),
or buffer
encoding : the encoding to use to decode py3 bytes, default is 'utf-8'
mode : One of 'rb' or 'wb' or 'ab'. default: 'rb'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback thoughts on this parameter? I think it'd be better to just remove it, and hardcode mode='wb' in the call to s3.get_filepath_or_buffer down below. That's essentially what we do for URLs with the BytesIO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback hard coding leads to failure to read from S3, an exception from s3fs https://github.com/dask/s3fs/blob/master/s3fs/core.py#L1005

I've decided to add it all the way in the call chain precisely for this reason. It might be possible to change s3fs implementation because from what I know S3 assets don't have a read/write notion in them, or split pandas code into 2 get_readble_filepath_or_buffer and get_writable_filepath_or_buffer but I don't feel I know the code base well enough judge.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes sorry I missed that.

In that case let's update the docstring to be numpydoc compliant: http://numpydoc.readthedocs.io/en/latest/format.html#sections

mode : {'rb', 'wb', 'ab'}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NP, my pleasure.

@jreback
Copy link
Contributor

jreback commented Jan 9, 2018

yeah this parameter should not be in he common call rather in the s3 specific

@TomAugspurger
Copy link
Contributor

Looks like you'll need to skip the test if the engine isn't available. I think that pyarrow understands S3FileSystem objects, so you should be able to parametrize on engine=['fastparquet', 'pyarrow'].

@TomAugspurger
Copy link
Contributor

You should just have to accept an engine argument, and it'll pick up that fixture.

@maximveksler
Copy link
Contributor Author

Hey guys,

It's my 3rd attempt to fix the unit test. Not too excited to spam commit history with CI a/b testing attempts :)

Any way to reproduce the failing CI locally?

@maximveksler
Copy link
Contributor Author

@TomAugspurger cool, looking into and thanks!

Q - If tests are failing because code can't find pyarrow / fastparquet and the fixture will cause the test to skip the unit test when it can't be found then... won't it defeat the whole purpose of the unit test ?!

Can't we just we have the CI server install pyarrow/fastparquet and make it available to the pytest instead?

@TomAugspurger
Copy link
Contributor

The failure at https://circleci.com/gh/pandas-dev/pandas/8985#tests/containers/2 was because pyarrow / fastparquet weren't installed, and the test wasn't skipped. If you take an engine argument (and pass it through to the writer / reader) it should be OK.

To test locally, uninstall both pyarrow and fastparquet. The test should be skipped.

@TomAugspurger
Copy link
Contributor

Can't we just we have the CI server install pyarrow/fastparquet and make it available to the pytest instead?

We do on some of our builds. But we also need to make sure pandas works without pyarrrow / fastparquet, so not all of our builds have them installed.

@maximveksler
Copy link
Contributor Author

Tom, regarding your comment on pyarrow understanding S3FileSystem.

I think you're right, because fastparquet seems to not understand them.

I'm getting

============================= test session starts =============================
platform win32 -- Python 3.6.3, pytest-3.3.2, py-1.5.2, pluggy-0.6.0
rootdir: C:\Users\maxim.veksler\source\pandas, inifile: setup.cfg
collected 2 items
test_parquet.py F
pandas\tests\io\test_parquet.py:490 (TestIntegrationWithS3.test_s3_roundtrip[fastparquet])
self = <pandas.tests.io.test_parquet.TestIntegrationWithS3 object at 0x00000211F2652320>
df_compat =    A    B
0  1  foo
1  2  foo
2  3  foo
s3_resource = s3.ServiceResource(), engine = 'fastparquet'

    def test_s3_roundtrip(self, df_compat, s3_resource, engine):
        # GH #19134
>       df_compat.to_parquet('s3://pandas-test/test.parquet', engine)

test_parquet.py:493: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
..\..\core\frame.py:1680: in to_parquet
    compression=compression, **kwargs)
..\..\io\parquet.py:227: in to_parquet
    return impl.write(df, path, compression=compression, **kwargs)
..\..\io\parquet.py:200: in write
    compression=compression, **kwargs)
..\..\..\venv\lib\site-packages\fastparquet\writer.py:802: in write
    compression, open_with, has_nulls, append)
..\..\..\venv\lib\site-packages\fastparquet\writer.py:687: in write_simple
    with open_with(fn, mode) as f:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

f = <S3File pandas-test/test.parquet>, mode = 'wb'

    def default_open(f, mode='rb'):
>       return open(f, mode)
E       TypeError: expected str, bytes or os.PathLike object, not S3File

..\..\..\venv\lib\site-packages\fastparquet\util.py:44: TypeError
.                                                       [100%]

================================== FAILURES ===================================
____________ TestIntegrationWithS3.test_s3_roundtrip[fastparquet] _____________

self = <pandas.tests.io.test_parquet.TestIntegrationWithS3 object at 0x00000211F2652320>
df_compat =    A    B
0  1  foo
1  2  foo
2  3  foo
s3_resource = s3.ServiceResource(), engine = 'fastparquet'

    def test_s3_roundtrip(self, df_compat, s3_resource, engine):
        # GH #19134
>       df_compat.to_parquet('s3://pandas-test/test.parquet', engine)

test_parquet.py:493: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
..\..\core\frame.py:1680: in to_parquet
    compression=compression, **kwargs)
..\..\io\parquet.py:227: in to_parquet
    return impl.write(df, path, compression=compression, **kwargs)
..\..\io\parquet.py:200: in write
    compression=compression, **kwargs)
..\..\..\venv\lib\site-packages\fastparquet\writer.py:802: in write
    compression, open_with, has_nulls, append)
..\..\..\venv\lib\site-packages\fastparquet\writer.py:687: in write_simple
    with open_with(fn, mode) as f:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

f = <S3File pandas-test/test.parquet>, mode = 'wb'

    def default_open(f, mode='rb'):
>       return open(f, mode)
E       TypeError: expected str, bytes or os.PathLike object, not S3File

..\..\..\venv\lib\site-packages\fastparquet\util.py:44: TypeError
------------------------------ Captured log call ------------------------------
core.py                    203 DEBUG    Open S3 connection.  Anonymous: False
===================== 1 failed, 1 passed in 9.85 seconds ======================Exception ignored in: <bound method S3File.__del__ of <S3File pandas-test/test.parquet>>
Traceback (most recent call last):
  File "C:\Users\maxim.veksler\source\pandas\venv\lib\site-packages\s3fs\core.py", line 1219, in __del__
  File "C:\Users\maxim.veksler\source\pandas\venv\lib\site-packages\s3fs\core.py", line 1200, in close
  File "C:\Users\maxim.veksler\source\pandas\venv\lib\site-packages\s3fs\core.py", line 936, in _call_s3
  File "C:\Users\maxim.veksler\source\pandas\venv\lib\site-packages\s3fs\core.py", line 170, in _call_s3
  File "C:\Users\maxim.veksler\source\pandas\venv\lib\site-packages\botocore\client.py", line 317, in _api_call
  File "C:\Users\maxim.veksler\source\pandas\venv\lib\site-packages\botocore\client.py", line 602, in _make_api_call
  File "C:\Users\maxim.veksler\source\pandas\venv\lib\site-packages\botocore\endpoint.py", line 143, in make_request
  File "C:\Users\maxim.veksler\source\pandas\venv\lib\site-packages\botocore\endpoint.py", line 168, in _send_request
  File "C:\Users\maxim.veksler\source\pandas\venv\lib\site-packages\botocore\endpoint.py", line 152, in create_request
  File "C:\Users\maxim.veksler\source\pandas\venv\lib\site-packages\botocore\hooks.py", line 227, in emit
  File "C:\Users\maxim.veksler\source\pandas\venv\lib\site-packages\botocore\hooks.py", line 210, in _emit
  File "C:\Users\maxim.veksler\source\pandas\venv\lib\site-packages\botocore\signers.py", line 90, in handler
  File "C:\Users\maxim.veksler\source\pandas\venv\lib\site-packages\botocore\signers.py", line 154, in sign
  File "C:\Users\maxim.veksler\source\pandas\venv\lib\site-packages\botocore\auth.py", line 420, in add_auth
  File "C:\Users\maxim.veksler\source\pandas\venv\lib\site-packages\botocore\auth.py", line 354, in add_auth
ImportError: sys.meta_path is None, Python is likely shutting down

Should I just add an exception in the FastParquetImpl for when attempting to a write operation to s3?

@TomAugspurger
Copy link
Contributor

Hmm, what version of s3fs and fastparquet do you have locally? I pulled your branch, added the engine argument, and they both pass for me.

pytest /Users/taugspurger/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/tests/io/test_parquet.py -k test_s3_roundtrip -v -rsx
=================================================================== test session starts ====================================================================
platform darwin -- Python 3.6.1, pytest-3.3.1, py-1.5.2, pluggy-0.6.0 -- /Users/taugspurger/Envs/pandas-dev/bin/python3.6
cachedir: .cache
rootdir: /Users/taugspurger/Envs/pandas-dev/lib/python3.6/site-packages/pandas, inifile: setup.cfg
plugins: xdist-1.15.0, rerunfailures-2.2, repeat-0.4.1, cov-2.5.1, annotate-1.0.0, hypothesis-3.44.9
collected 44 items

pandas/tests/io/test_parquet.py::TestIntegrationWithS3::test_s3_roundtrip[fastparquet] PASSED                                                        [ 50%]
pandas/tests/io/test_parquet.py::TestIntegrationWithS3::test_s3_roundtrip[pyarrow] PASSED                                                            [100%]

=================================================================== 42 tests deselected ====================================================================
========================================================= 2 passed, 42 deselected in 1.70 seconds ==========================================================

@TomAugspurger
Copy link
Contributor

FYI, you may want to run flake8 pandas/tests/io/test_parquet.py. That'll catch any linting errors. I think you need another newline before your new test class.

@maximveksler
Copy link
Contributor Author

fastparquet: 0.1.3
pyarrow: 0.8.0

obtained with pd.show_versions()

I find it odd that the test is passing for you. It should have failed because I now discovered FastParquetImpl.write is also missing the get_filepath_or_buffer(path, mode='wb') fix.

Could you please verify that the test_s3_roundtrip(self, df_compat, s3_resource, engine) is invoked twice on your end? (once for pyarrow and once for fastparquet). Another difference might be that I'm on a Windows machine (don't ask... :/ )

Last - thanks for the flake8 tip, will do now.

@maximveksler
Copy link
Contributor Author

s3fs: 0.1.2 (sorry, it's getting late)

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jan 9, 2018

My apologies, I forgot to pass engine through to the readers / writers. Fastparquet does indeed fail (probably since it's .write needs the mode keyword to go through as well).

Here's my current diff.

+
 class TestIntegrationWithS3(Base):
-    def test_s3_roundtrip(self, df_compat, s3_resource):
+    def test_s3_roundtrip(self, df_compat, s3_resource, engine):
         # GH #19134
-        df_compat.to_parquet('s3://pandas-test/test.parquet')
+        key = 's3://pandas-test/test-{}.parquet'.format(engine)
+        df_compat.to_parquet(key, engine=engine)

         expected = df_compat
-        actual = pd.read_parquet('s3://pandas-test/test.parquet')
+        actual = pd.read_parquet(key, engine=engine)

         tm.assert_frame_equal(expected, actual)
-

@maximveksler
Copy link
Contributor Author

Yeah, fastparquet fails even with the fix.

f = <S3File pandas-test/test.parquet>, mode = 'wb'

    def default_open(f, mode='rb'):
>       return open(f, mode)
E       TypeError: expected str, bytes or os.PathLike object, not S3File

..\..\..\venv\lib\site-packages\fastparquet\util.py:44: TypeError

I think i'll leave it as a known limitation.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jan 9, 2018

@maximveksler this is doable. In the fastparquet writer, do

-        path, _, _ = get_filepath_or_buffer(path)
+        path, _, _ = get_filepath_or_buffer(path, mode='wb')
         with catch_warnings(record=True):
             self.api.write(path, df,
-                           compression=compression, **kwargs)
+                           compression=compression,
+                           open_with=lambda path, mode: path,
+                           **kwargs)

That's basically telling fastparquet "we already have an open file, just use it".


if engine == 'pyarrow':
df_compat.to_parquet('s3://pandas-test/test.parquet', engine)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don’t inherit Base nore use a class
make just a function
use the fixtures instead of engine directly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback how about now? Am I not using the fixture already? Not sure what you mean by that.

@maximveksler
Copy link
Contributor Author

@TomAugspurger please see latest commit, I've added fastparquet write and it seems to work (had to disable snappy) but read I can't seem to be able to get to work without changes in api.py within fastparquet itself.

@TomAugspurger
Copy link
Contributor

@martindurant what's the easiest way to open a fastparquet.ParquetFile(fn) where file is an S3File? Or does it have to be a str? fastparquet.ParquetFile(file, open_with=lambda path, mode: path) doesn't quite work.

@martindurant
Copy link
Contributor

martindurant commented Jan 9, 2018

No, there is no way to simply pass a file-file object to fastparquet.ParquetFile, since in general we are looking for multiple files and/or metadata.

fastparquet.ParquetFile(fn, open_with=s3.open)

where s3 is a S3FileSystem.
You could have any function of the form func(fn, mode).

@jreback jreback added IO Data IO issues that don't fit into a more specific label IO Parquet parquet, feather labels Jan 10, 2018
@jreback
Copy link
Contributor

jreback commented Jan 14, 2018

lgtm. ping when changed & green.

@jreback
Copy link
Contributor

jreback commented Jan 14, 2018

@maximveksler keep in mind, PR's can take a while to get merged. we have quite a lot and quite a bit of activity. All PR's need review and feedback time.

@maximveksler
Copy link
Contributor Author

@jreback got ya, NP..

@@ -190,6 +190,10 @@ def __init__(self):
self.api = fastparquet

def write(self, df, path, compression='snappy', **kwargs):
if is_s3_url(path):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this if block, go from S3File -> url

path = 's3://{}'.format(path.path)
kwargs['open_with'] = path.s3.open

See if that works?

@@ -179,6 +179,8 @@ def get_filepath_or_buffer(filepath_or_buffer, encoding=None,
filepath_or_buffer : a url, filepath (str, py.path.local or pathlib.Path),
or buffer
encoding : the encoding to use to decode py3 bytes, default is 'utf-8'
mode : {'rb', 'wb', 'ab'} applies to S3 where a write mandates opening the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no this is just distracting as it only applies to s3 and is simply a pass thru option. pls change.


# repeat
to_parquet(df, path, engine, **write_kwargs)
result = pd.read_parquet(path, engine, **read_kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls do it my way

parquet_file = self.api.ParquetFile(path)
if is_s3_url(path):
s3, _, _ = get_filepath_or_buffer(path)
parquet_file = self.api.ParquetFile(path, open_with=s3.s3.open)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out guys, that we had both a write and a read problem for when using fastparquet through S3Filesystem. We now should have a good test coverage of both use cases, and a workable implementation. (hands crossed).

@maximveksler maximveksler changed the title Fixes writing to_parquet for s3 destinations Fix read_parquet, to_parquet for s3 destinations Jan 16, 2018
@@ -415,6 +415,7 @@ I/O
- Bug in :func:`read_sas` where a file with 0 variables gave an ``AttributeError`` incorrectly. Now it gives an ``EmptyDataError`` (:issue:`18184`)
- Bug in :func:`DataFrame.to_latex()` where pairs of braces meant to serve as invisible placeholders were escaped (:issue:`18667`)
- Bug in :func:`read_json` where large numeric values were causing an ``OverflowError`` (:issue:`18842`)
- Bug in :func:`DataFrame.to_parquet` exception is thrown if write destination is S3 (:issue:`19134`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where exception was raised

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are several possible exceptions in the fail chain.

3 different components

  • S3Filesystem,
  • pyarrow writer,
  • fastparquet reader & writer.

pyarrow - write attempt

FileNotFoundException or ValueError (depends on if file exists in S3 or not).

fastparquet - read attempt

Exception in attempting to concat str and S3File

fastparquet - write attempt

Exception in attempting to open path using default_open

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where an exception was raised if the write destination is S3.

@@ -194,14 +194,25 @@ def write(self, df, path, compression='snappy', **kwargs):
# thriftpy/protocol/compact.py:339:
# DeprecationWarning: tostring() is deprecated.
# Use tobytes() instead.
path, _, _ = get_filepath_or_buffer(path)

if is_s3_url(path):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment on what is happening here

path, _, _ = get_filepath_or_buffer(path)
parquet_file = self.api.ParquetFile(path)
if is_s3_url(path):
s3, _, _ = get_filepath_or_buffer(path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason not to directly call the s3.get_filepath_or_buffer here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not familiar with pandas code, i'm frankly not sure about existing design where common.py#get_filepath_or_buffer returns s3file, or str, of buffer but I'm holding back from making too many changes in my first PR... so I prefer to continue using what is implemented and working in for ex. PyArrowImpl#read where reading from s3 works.


# repeat
to_parquet(df, path, engine, **write_kwargs)
result = pd.read_parquet(path, engine, **read_kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs updating

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very small doc corrections. ping on green.

@@ -415,6 +415,7 @@ I/O
- Bug in :func:`read_sas` where a file with 0 variables gave an ``AttributeError`` incorrectly. Now it gives an ``EmptyDataError`` (:issue:`18184`)
- Bug in :func:`DataFrame.to_latex()` where pairs of braces meant to serve as invisible placeholders were escaped (:issue:`18667`)
- Bug in :func:`read_json` where large numeric values were causing an ``OverflowError`` (:issue:`18842`)
- Bug in :func:`DataFrame.to_parquet` exception is thrown if write destination is S3 (:issue:`19134`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where an exception was raised if the write destination is S3.

@@ -179,10 +179,11 @@ def get_filepath_or_buffer(filepath_or_buffer, encoding=None,
filepath_or_buffer : a url, filepath (str, py.path.local or pathlib.Path),
or buffer
encoding : the encoding to use to decode py3 bytes, default is 'utf-8'
mode : str, optional applies when opening S3 destinations for writing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mode : str, optional

# path is s3:// so we need to open the s3file in 'wb' mode.
# TODO: Support 'ab'
path, _, _ = get_filepath_or_buffer(path, mode='wb')
# And pass the opened s3file to the fastparquet internal impl.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blank line here

@jreback jreback added this to the 0.23.0 milestone Jan 17, 2018
@jreback
Copy link
Contributor

jreback commented Jan 17, 2018

lgtm @maximveksler ping on green (may be a while as travis is currently in backlog on mac biulds)

@maximveksler
Copy link
Contributor Author

@jreback NP, will do.

@maximveksler
Copy link
Contributor Author

@jreback, @TomAugspurger i'm refactoring test_parquet.py. Please don't merge before ping from me.

@jreback
Copy link
Contributor

jreback commented Jan 17, 2018

@maximveksler this PR is ok
you can do one on top of this as a refactor

@maximveksler maximveksler changed the title Fix read_parquet, to_parquet for s3 destinations BUG: read_parquet, to_parquet for s3 destinations Jan 17, 2018
@maximveksler
Copy link
Contributor Author

@jreback looks like travis is back online, could you please rerun the build ?

@TomAugspurger
Copy link
Contributor

Looks like it's queued: https://travis-ci.org/pandas-dev/pandas/pull_requests

@jreback jreback merged commit 6e0927e into pandas-dev:master Jan 18, 2018
@jreback
Copy link
Contributor

jreback commented Jan 18, 2018

thanks @maximveksler

nice patch!

@maximveksler
Copy link
Contributor Author

maximveksler commented Jan 18, 2018

Appreciate the guidance and feedback loops @TomAugspurger @jreback @martindurant

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Data IO issues that don't fit into a more specific label IO Parquet parquet, feather
Projects
None yet
Development

Successfully merging this pull request may close these issues.

to_parquet fails when S3 is the destination
6 participants