Skip to content

SparseDataFrame from_dict doesn't adhere to sparse input semantics #21909

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
matanox opened this issue Jul 14, 2018 · 2 comments · Fixed by #28425
Closed

SparseDataFrame from_dict doesn't adhere to sparse input semantics #21909

matanox opened this issue Jul 14, 2018 · 2 comments · Fixed by #28425
Labels
Sparse Sparse Data Type

Comments

@matanox
Copy link

matanox commented Jul 14, 2018

Code Sample

from pandas import DataFrame, SparseDataFrame

list_of_sparse_semantics = [{0: 1, 4: 1, 22: 1, 1001: 2}, {0: 1, 3: 1}, {0: 1, 55: 1}]

df = DataFrame.from_dict(list_of_sparse_semantics)
sdf = SparseDataFrame.from_dict(list_of_sparse_semantics)

Problem description

It may seem as if SparseDataFrame does not follow the sparse semantics that are followed by DataFrame's from_dict method. It stuffs each dict as a single column/element, rather than treating its key-val tuples as sparse content for building the equivalent sparse row as DataFrame does.

Expected Output

sdf.equals(df)

True

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-130-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.20.3
pytest: 3.2.1
pip: 10.0.1
setuptools: 36.5.0.post20170921
Cython: 0.26.1
numpy: 1.14.3
scipy: 0.19.1
xarray: None
IPython: 6.4.0
sphinx: 1.6.3
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.1.0
openpyxl: 2.4.8
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 3.8.0
bs4: 4.6.0
html5lib: 0.999999999
sqlalchemy: 1.1.13
pymysql: None
psycopg2: None
jinja2: 2.8.1
s3fs: None
pandas_gbq: None
pandas_datareader: None
@matanox
Copy link
Author

matanox commented Jul 14, 2018

per my availability when this is ever responded to, happy to contribute a fix/feature for this, if given fair guidance as to where to touch and what to look out for 😃

@matanox
Copy link
Author

matanox commented Jul 14, 2018

Also, the following (a workaround) seems to work slower than one could hope for, given the input is already sparse when it arrives at the SparseDataFrame.... but might boil down to my specific scipy usage strategy below.

from scipy.sparse import dok_matrix, coo_matrix
from pandas import SparseDataFrame

%time i, j, data = zip(*((i, t[0], t[1]) for i, row in enumerate(a) for t in row))

%time m = coo_matrix((data, (i, j)), shape=(num_of_docs, lexicon_size))

%time sdf = SparseDataFrame(m).fillna(0)

CPU times: user 80 ms, sys: 4 ms, total: 84 ms
Wall time: 84.5 ms
CPU times: user 28 ms, sys: 4 ms, total: 32 ms
Wall time: 28.5 ms
CPU times: user 41.9 s, sys: 132 ms, total: 42 s
Wall time: 41.8 s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Sparse Sparse Data Type
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants