-
Notifications
You must be signed in to change notification settings - Fork 4k
Closed
Description
GitHub Issues for Apache Arrow
I opened a Pandas issue but they closed it and referred me here. I was hoping to use parquet files as a way to share pandas.SparseDataFrame objects, but it appears that the current to_parquet method fails with columns of different lengths:
import pandas as pd # v0.22.0
import scipy.sparse # v1.0.1
rpd = pd.SparseDataFrame(scipy.sparse.random(1000, 1000),
columns=list(map(str, range(1000))),
default_fill_value=0.0)
rpd.to_parquet('rpd.pq')---------------------------------------------------------------------------
ArrowIOError Traceback (most recent call last)
<ipython-input-65-1aeaae9e36a0> in <module>()
4 columns=list(map(str, range(1000))),
5 default_fill_value=0.0)
----> 6 rpd.to_parquet('rpd.pq')
...
ArrowIOError: Column 8 had 4 while previous column had 8
Poking around, Pandas is just passing things straight into pyarrow, so I guess there's no support for sparse matrices at the moment? This seems like it'd be a nice use-case because the columns can be heavily compressed, but the current implementation needs to a dense version, which necessitates a round-trip via giant memory consumption.
Are there plans to improve this support, or is this not a good use for the format?
heydenberk, den-run-ai, ntdef, sergei3000, cornhundred and 2 more
Metadata
Metadata
Assignees
Labels
No labels