PyArrow gives ArrowTypeError serializing Pandas nullable Int64 #4168

jkleint · 2019-04-17T21:31:38Z

With the new Pandas 0.24 nullable integer types, Pandas .to_parquet() gives an ArrowTypeError. Not sure if this is a Pandas or PyArrow issue. This is with Python 3.7.2 and pyarrow 0.13.

pd.DataFrame({'i': [1, 2, 3, np.nan]}, dtype='Int64').to_parquet('nullint.parquet')

ArrowTypeError                            Traceback (most recent call last)
<ipython-input-27-bf6213e53d7b> in <module>
----> 1 pd.DataFrame({'i': [1, 2, 3, np.nan]}, dtype='Int64').to_parquet('nullint.parquet')

/python/lib/python3.7/site-packages/pandas/core/frame.py in to_parquet(self, fname, engine, compression, index, partition_cols, **kwargs)
   2201         to_parquet(self, fname, engine,
   2202                    compression=compression, index=index,
-> 2203                    partition_cols=partition_cols, **kwargs)
   2204 
   2205     @Substitution(header='Whether to print column labels, default True')

/python/lib/python3.7/site-packages/pandas/io/parquet.py in to_parquet(df, path, engine, compression, index, partition_cols, **kwargs)
    250     impl = get_engine(engine)
    251     return impl.write(df, path, compression=compression, index=index,
--> 252                       partition_cols=partition_cols, **kwargs)
    253 
    254 

/python/lib/python3.7/site-packages/pandas/io/parquet.py in write(self, df, path, compression, coerce_timestamps, index, partition_cols, **kwargs)
    111         else:
    112             from_pandas_kwargs = {'preserve_index': index}
--> 113         table = self.api.Table.from_pandas(df, **from_pandas_kwargs)
    114         if partition_cols is not None:
    115             self.api.parquet.write_to_dataset(

/python/lib/python3.7/site-packages/pyarrow/table.pxi in pyarrow.lib.Table.from_pandas()

/python/lib/python3.7/site-packages/pyarrow/pandas_compat.py in dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe)
    466         arrays = [convert_column(c, t)
    467                   for c, t in zip(columns_to_convert,
--> 468                                   convert_types)]
    469     else:
    470         from concurrent import futures

/python/lib/python3.7/site-packages/pyarrow/pandas_compat.py in <listcomp>(.0)
    465     if nthreads == 1:
    466         arrays = [convert_column(c, t)
--> 467                   for c, t in zip(columns_to_convert,
    468                                   convert_types)]
    469     else:

/python/lib/python3.7/site-packages/pyarrow/pandas_compat.py in convert_column(col, ty)
    461             e.args += ("Conversion failed for column {0!s} with type {1!s}"
    462                        .format(col.name, col.dtype),)
--> 463             raise e
    464 
    465     if nthreads == 1:

/python/lib/python3.7/site-packages/pyarrow/pandas_compat.py in convert_column(col, ty)
    455     def convert_column(col, ty):
    456         try:
--> 457             return pa.array(col, type=ty, from_pandas=True, safe=safe)
    458         except (pa.ArrowInvalid,
    459                 pa.ArrowNotImplementedError,

/python/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib.array()

/python/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib._ndarray_to_array()

/python/lib/python3.7/site-packages/pyarrow/array.pxi in pyarrow.lib._ndarray_to_type()

/python/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowTypeError: ('Did not pass numpy.dtype object', 'Conversion failed for column i with type Int64')

The text was updated successfully, but these errors were encountered:

xhochy · 2019-04-23T10:23:56Z

@jkleint This is a pyarrow missing feature. The new types were implemented recently in pandas and no support for them was yet implemented in pyarrow. Can you open an issue about them over at https://issues.apache.org/jira/projects/ARROW/issues ?

wesm · 2019-04-23T18:59:52Z

@jorisvandenbossche what is pandas's expected memory layout for the new integer array types? The 0.23 -> 0.24 shift will present a bit of a compatibility headache (we'll need a flag whether to produce the new memory layout if the user has a new enough pandas)

jorisvandenbossche · 2019-04-23T19:10:39Z

The new integers are stored as pure numpy array for the values and a boolean mask array:

In [122]: a = pd.array([1, 2, None], dtype='Int64')

In [123]: a
Out[123]: 
<IntegerArray>
[1, 2, NaN]
Length: 3, dtype: Int64

In [124]: a._data 
Out[124]: array([1, 2, 1])

In [125]: a._mask
Out[125]: array([False, False,  True])

But, I am not sure it is up to pyarrow to add functionality to convert those (although you could argue to make an exception for it for the extension arrays added to pandas itself).
We need to have a general discussion about this on serialization and arrow conversion of ExtensionArrays, as also other extension array authors (like fletcher, geopandas, cyberpandas, ..) will want to plug into arrow (eg for parquet writing), and we can't add all this to pyarrow itself.

wesm · 2019-05-16T22:57:26Z

Can we open a JIRA issue about this and close this issue?

jorisvandenbossche · 2019-05-20T18:09:04Z

I think this is covered by the existing issues https://issues.apache.org/jira/browse/ARROW-5271 and https://issues.apache.org/jira/browse/ARROW-2428, which cover the general ExtensionArray topic. Or would you prefer to have a specific issue for nullable integers (that would be blocked by those issues)?

wesm · 2019-05-20T18:48:11Z

Yeah, it would be nice to have an issue specifically about nullable integers to make sure it gets done (it's easy for such a thing to fall through the cracks)

jorisvandenbossche · 2019-05-20T18:56:00Z

OK, opened https://issues.apache.org/jira/browse/ARROW-5379

jorisvandenbossche mentioned this issue May 6, 2019

Serialization / Deserialization of ExtensionArrays pandas-dev/pandas#20612

Open

wesm closed this as completed May 22, 2019

This was referenced Sep 14, 2019

Pandas casting int64 to float64, misrepresenting value apache/superset#8225

Closed

Handle int64 columns with missing data in SQL Lab apache/superset#8226

Merged

abrichr mentioned this issue Jan 31, 2021

Joblib pickling performance for pandas DataFrame joblib/joblib#581

Open

asfimport mentioned this issue Nov 2, 2020

[Python] support pandas' nullable Integer type in from_pandas #21838

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PyArrow gives ArrowTypeError serializing Pandas nullable Int64 #4168

PyArrow gives ArrowTypeError serializing Pandas nullable Int64 #4168

jkleint commented Apr 17, 2019

xhochy commented Apr 23, 2019

Uh oh!

wesm commented Apr 23, 2019

Uh oh!

jorisvandenbossche commented Apr 23, 2019

Uh oh!

wesm commented May 16, 2019

Uh oh!

jorisvandenbossche commented May 20, 2019

Uh oh!

wesm commented May 20, 2019

Uh oh!

jorisvandenbossche commented May 20, 2019

Uh oh!

PyArrow gives ArrowTypeError serializing Pandas nullable Int64 #4168

PyArrow gives ArrowTypeError serializing Pandas nullable Int64 #4168

Comments

jkleint commented Apr 17, 2019

xhochy commented Apr 23, 2019

Uh oh!

wesm commented Apr 23, 2019

Uh oh!

jorisvandenbossche commented Apr 23, 2019

Uh oh!

wesm commented May 16, 2019

Uh oh!

jorisvandenbossche commented May 20, 2019

Uh oh!

wesm commented May 20, 2019

Uh oh!

jorisvandenbossche commented May 20, 2019

Uh oh!