Skip to content

PyArrow gives ArrowTypeError serializing Pandas nullable Int64 #4168

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jkleint opened this issue Apr 17, 2019 · 7 comments
Closed

PyArrow gives ArrowTypeError serializing Pandas nullable Int64 #4168

jkleint opened this issue Apr 17, 2019 · 7 comments

Comments

@jkleint
Copy link

jkleint commented Apr 17, 2019

With the new Pandas 0.24 nullable integer types, Pandas .to_parquet() gives an ArrowTypeError. Not sure if this is a Pandas or PyArrow issue. This is with Python 3.7.2 and pyarrow 0.13.

pd.DataFrame({'i': [1, 2, 3, np.nan]}, dtype='Int64').to_parquet('nullint.parquet')
ArrowTypeError                            Traceback (most recent call last)
<ipython-input-27-bf6213e53d7b> in <module>
----> 1 pd.DataFrame({'i': [1, 2, 3, np.nan]}, dtype='Int64').to_parquet('nullint.parquet')

/python/lib/python3.7/site-packages/pandas/core/frame.py in to_parquet(self, fname, engine, compression, index, partition_cols, **kwargs)
   2201         to_parquet(self, fname, engine,
   2202                    compression=compression, index=index,
-> 2203                    partition_cols=partition_cols, **kwargs)
   2204 
   2205     @Substitution(header='Whether to print column labels, default True')

/python/lib/python3.7/site-packages/pandas/io/parquet.py in to_parquet(df, path, engine, compression, index, partition_cols, **kwargs)
    250     impl = get_engine(engine)
    251     return impl.write(df, path, compression=compression, index=index,
--> 252                       partition_cols=partition_cols, **kwargs)
    253 
    254 

/python/lib/python3.7/site-packages/pandas/io/parquet.py in write(self, df, path, compression, coerce_timestamps, index, partition_cols, **kwargs)
    111         else:
    112             from_pandas_kwargs = {'preserve_index': index}
--> 113         table = self.api.Table.from_pandas(df, **from_pandas_kwargs)
    114         if partition_cols is not None:
    115             self.api.parquet.write_to_dataset(

/python/lib/python3.7/site-packages/pyarrow/table.pxi in pyarrow.lib.Table.from_pandas()

/python/lib/python3.7/site-packages/pyarrow/pandas_compat.py in dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe)
    466         arrays = [convert_column(c, t)
    467                   for c, t in zip(columns_to_convert,
--> 468                                   convert_types)]
    469     else:
    470         from concurrent import futures

/python/lib/python3.7/site-packages/pyarrow/pandas_compat.py in <listcomp>(.0)
    465     if nthreads == 1:
    466         arrays = [convert_column(c, t)
--> 467                   for c, t in zip(columns_to_convert,
    468                                   convert_types)]
    469     else:

/python/lib/python3.7/site-packages/pyarrow/pandas_compat.py in convert_column(col, ty)
    461             e.args += ("Conversion failed for column {0!s} with type {1!s}"
    462                        .format(col.name, col.dtype),)
--> 463             raise e
    464 
    465     if nthreads == 1:

/python/lib/python3.7/site-packages/pyarrow/pandas_compat.py in convert_column(col, ty)
    455     def convert_column(col, ty):
    456         try:
--> 457             return pa.array(col, type=ty, from_pandas=True, safe=safe)
    458         except (pa.ArrowInvalid,
    459                 pa.ArrowNotImplementedError,

/python/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib.array()

/python/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib._ndarray_to_array()

/python/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib._ndarray_to_type()

/python/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowTypeError: ('Did not pass numpy.dtype object', 'Conversion failed for column i with type Int64')
@xhochy
Copy link
Member

xhochy commented Apr 23, 2019

@jkleint This is a pyarrow missing feature. The new types were implemented recently in pandas and no support for them was yet implemented in pyarrow. Can you open an issue about them over at https://issues.apache.org/jira/projects/ARROW/issues ?

@wesm
Copy link
Member

wesm commented Apr 23, 2019

@jorisvandenbossche what is pandas's expected memory layout for the new integer array types? The 0.23 -> 0.24 shift will present a bit of a compatibility headache (we'll need a flag whether to produce the new memory layout if the user has a new enough pandas)

@jorisvandenbossche
Copy link
Member

The new integers are stored as pure numpy array for the values and a boolean mask array:

In [122]: a = pd.array([1, 2, None], dtype='Int64')

In [123]: a
Out[123]: 
<IntegerArray>
[1, 2, NaN]
Length: 3, dtype: Int64

In [124]: a._data 
Out[124]: array([1, 2, 1])

In [125]: a._mask
Out[125]: array([False, False,  True])

But, I am not sure it is up to pyarrow to add functionality to convert those (although you could argue to make an exception for it for the extension arrays added to pandas itself).
We need to have a general discussion about this on serialization and arrow conversion of ExtensionArrays, as also other extension array authors (like fletcher, geopandas, cyberpandas, ..) will want to plug into arrow (eg for parquet writing), and we can't add all this to pyarrow itself.

@wesm
Copy link
Member

wesm commented May 16, 2019

Can we open a JIRA issue about this and close this issue?

@jorisvandenbossche
Copy link
Member

I think this is covered by the existing issues https://issues.apache.org/jira/browse/ARROW-5271 and https://issues.apache.org/jira/browse/ARROW-2428, which cover the general ExtensionArray topic. Or would you prefer to have a specific issue for nullable integers (that would be blocked by those issues)?

@wesm
Copy link
Member

wesm commented May 20, 2019

Yeah, it would be nice to have an issue specifically about nullable integers to make sure it gets done (it's easy for such a thing to fall through the cracks)

@jorisvandenbossche
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants