Support for sparse dataframes

## GitHub Issues for Apache Arrow

I opened a [Pandas issue](https://github.com/pandas-dev/pandas/issues/20692) but they closed it and referred me here. I was hoping to use parquet files as a way to share `pandas.SparseDataFrame` objects, but it appears that the current `to_parquet` method fails with columns of different lengths: 

```python
import pandas as pd # v0.22.0
import scipy.sparse # v1.0.1

rpd = pd.SparseDataFrame(scipy.sparse.random(1000, 1000), 
                         columns=list(map(str, range(1000))),
                         default_fill_value=0.0)
rpd.to_parquet('rpd.pq')
```
```
---------------------------------------------------------------------------
ArrowIOError                              Traceback (most recent call last)
<ipython-input-65-1aeaae9e36a0> in <module>()
      4                          columns=list(map(str, range(1000))),
      5                          default_fill_value=0.0)
----> 6 rpd.to_parquet('rpd.pq')

...

ArrowIOError: Column 8 had 4 while previous column had 8

```

Poking around, Pandas is just passing things straight into pyarrow, so I guess there's no support for sparse matrices at the moment? This seems like it'd be a nice use-case because the columns can be heavily compressed, but the current implementation needs to a dense version, which necessitates a round-trip via giant memory consumption.

Are there plans to improve this support, or is this not a good use for the format?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support for sparse dataframes #1894

GitHub Issues for Apache Arrow

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support for sparse dataframes #1894

Description

GitHub Issues for Apache Arrow

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions