SparseDataFrame from_dict doesn't adhere to sparse input semantics #21909

matanox · 2018-07-14T08:04:10Z

Code Sample

from pandas import DataFrame, SparseDataFrame

list_of_sparse_semantics = [{0: 1, 4: 1, 22: 1, 1001: 2}, {0: 1, 3: 1}, {0: 1, 55: 1}]

df = DataFrame.from_dict(list_of_sparse_semantics)
sdf = SparseDataFrame.from_dict(list_of_sparse_semantics)

Problem description

It may seem as if SparseDataFrame does not follow the sparse semantics that are followed by DataFrame's from_dict method. It stuffs each dict as a single column/element, rather than treating its key-val tuples as sparse content for building the equivalent sparse row as DataFrame does.

Expected Output

sdf.equals(df)

True

Output of `pd.show_versions()`

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-130-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.20.3
pytest: 3.2.1
pip: 10.0.1
setuptools: 36.5.0.post20170921
Cython: 0.26.1
numpy: 1.14.3
scipy: 0.19.1
xarray: None
IPython: 6.4.0
sphinx: 1.6.3
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.1.0
openpyxl: 2.4.8
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 3.8.0
bs4: 4.6.0
html5lib: 0.999999999
sqlalchemy: 1.1.13
pymysql: None
psycopg2: None
jinja2: 2.8.1
s3fs: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

matanox · 2018-07-14T09:21:14Z

per my availability when this is ever responded to, happy to contribute a fix/feature for this, if given fair guidance as to where to touch and what to look out for 😃

matanox · 2018-07-14T12:23:59Z

Also, the following (a workaround) seems to work slower than one could hope for, given the input is already sparse when it arrives at the SparseDataFrame.... but might boil down to my specific scipy usage strategy below.

from scipy.sparse import dok_matrix, coo_matrix
from pandas import SparseDataFrame

%time i, j, data = zip(*((i, t[0], t[1]) for i, row in enumerate(a) for t in row))

%time m = coo_matrix((data, (i, j)), shape=(num_of_docs, lexicon_size))

%time sdf = SparseDataFrame(m).fillna(0)

CPU times: user 80 ms, sys: 4 ms, total: 84 ms
Wall time: 84.5 ms
CPU times: user 28 ms, sys: 4 ms, total: 32 ms
Wall time: 28.5 ms
CPU times: user 41.9 s, sys: 132 ms, total: 42 s
Wall time: 41.8 s

jschendel added the Sparse Sparse Data Type label Jul 15, 2018

TomAugspurger mentioned this issue Sep 16, 2019

Remove SparseSeries and SparseDataFrame #28425

Merged

TomAugspurger closed this as completed in #28425 Sep 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SparseDataFrame from_dict doesn't adhere to sparse input semantics #21909

SparseDataFrame from_dict doesn't adhere to sparse input semantics #21909

matanox commented Jul 14, 2018 •

edited

Loading

matanox commented Jul 14, 2018 •

edited

Loading

matanox commented Jul 14, 2018 •

edited

Loading

SparseDataFrame from_dict doesn't adhere to sparse input semantics #21909

SparseDataFrame from_dict doesn't adhere to sparse input semantics #21909

Comments

matanox commented Jul 14, 2018 • edited Loading

Code Sample

Problem description

Expected Output

Output of pd.show_versions()

matanox commented Jul 14, 2018 • edited Loading

matanox commented Jul 14, 2018 • edited Loading

matanox commented Jul 14, 2018 •

edited

Loading

Output of `pd.show_versions()`

matanox commented Jul 14, 2018 •

edited

Loading

matanox commented Jul 14, 2018 •

edited

Loading