Skip to content

Groupby Get Groups fails with "OverflowError: value too large to convert to int" #22729

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
MaxBenChrist opened this issue Sep 16, 2018 · 10 comments · Fixed by #22805
Closed

Groupby Get Groups fails with "OverflowError: value too large to convert to int" #22729

MaxBenChrist opened this issue Sep 16, 2018 · 10 comments · Fixed by #22805
Labels
Bug Groupby Internals Related to non-user accessible pandas implementation
Milestone

Comments

@MaxBenChrist
Copy link

MaxBenChrist commented Sep 16, 2018

Hi, I am the maintainer of tsfresh, we calculate features from time series and rely on pandas internally.

Since we open sourced tsfresh, we had numerous reports of tsfresh crashing on big datasets but were never able to pin it down. The errors seem to occur for big datasets (100 GB+). I also tried to reproduce it but do not have access to a machine that has enough memory at the moment.

Recently, we found the place, it seem to crash at

[x + (y,) for x, y in df.groupby([column_id, column_kind])[column_value]]

df is a dataframe that looks like

        ====  ======  =========
          id  kind          val
        ====  ======  =========
           1  a       -0.21761
           1  a       -0.613667
           1  a       -2.07339
           2  b       -0.576254
           2  b       -1.21924
        ====  ======  =========

and it should get converted into

        [(1, 'a', pd.Series([-0.217610, -0.613667, -2.073386]),
         (2, 'b', pd.Series([-0.576254, -1.219238])]

An exemplary stack trace where this crashes is

data_in_chunks = [x + (y,) for x, y in df.groupby([column_id, column_kind])[column_value]]
File "/home/yuval/pai/projects/pai-tsfresh/pai_tsfresh/feature_extraction/extraction.py", line 217, in 
data_in_chunks = [x + (y,) for x, y in df.groupby([column_id, column_kind])[column_value]]
File "/usr/local/lib/python3.6/dist-packages/pandas/core/groupby.py", line 1922, in get_iterator
splitter = self._get_splitter(data, axis=axis)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/groupby.py", line 1928, in _get_splitter
comp_ids, _, ngroups = self.group_info
File "pandas/_libs/properties.pyx", line 38, in pandas._libs.properties.cache_readonly.get
File "/usr/local/lib/python3.6/dist-packages/pandas/core/groupby.py", line 2040, in group_info
comp_ids, obs_group_ids = self._get_compressed_labels()
File "/usr/local/lib/python3.6/dist-packages/pandas/core/groupby.py", line 2056, in _get_compressed_labels
all_labels = [ping.labels for ping in self.groupings]
File "/usr/local/lib/python3.6/dist-packages/pandas/core/groupby.py", line 2056, in 
all_labels = [ping.labels for ping in self.groupings]
File "/usr/local/lib/python3.6/dist-packages/pandas/core/groupby.py", line 2750, in labels
self._make_labels()
File "/usr/local/lib/python3.6/dist-packages/pandas/core/groupby.py", line 2767, in _make_labels
self.grouper, sort=self.sort)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/algorithms.py", line 468, in factorize
table = hash_klass(size_hint or len(values))
File "pandas/_libs/hashtable_class_helper.pxi", line 1005, in pandas._libs.hashtable.StringHashTable.init
OverflowError: value too large to convert to int

See also the discussion in blue-yonder/tsfresh#418

So, I assume that we hit some kind of threshold. Any idea how to get the groupby more robust for bigger datasets?

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-1018-gcp
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: None
pip: 9.0.1
setuptools: 20.7.0
Cython: None
numpy: 1.15.0
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2018.3
blosc: None
bottleneck: None
tables: None
numexpr: 2.6.4
feather: None
matplotlib: 2.1.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.9999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@mroeschke
Copy link
Member

Looked at the related thread briefly; how many unique groups are created when df.groupby([column_id, column_kind]) is called?

My guess from the returned stack trace is that pd.factorize is trying to create integer labels for each group, and the total number of groups is more that what int64 can represent?

@mroeschke mroeschke added Groupby Needs Info Clarification about behavior needed to assess issue labels Sep 17, 2018
@yuval-nardi
Copy link

yuval-nardi commented Sep 17, 2018

@MaxBenChrist @mroeschke
It fails on table = hash_klass(size_hint or len(values)).
When I do hash_klass(363951638*3) it succeeds:

> hash_klass(363951638*3)
<pandas._libs.hashtable.StringHashTable at 0x1227f0cf0>

When I do hash_klass(363951638*7) it fails:

> hash_klass(363951638*7)
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2910, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-28-10a57ee47b42>", line 1, in <module>
    hash_klass(363951638*7)
  File "pandas/_libs/hashtable_class_helper.pxi", line 1229, in pandas._libs.hashtable.StringHashTable.__init__
OverflowError: value too large to convert to int

The 3 and 7 here relate to the 3 and 7 columns mentioned in the above tsfresh thread (blue-yonder/tsfresh#418). 363951638 is the number of rows in my df_rolled dataframe (also mentioned in the thread).

Anyhow, it seems that len(values) is not the number of unique groups but the number of rows of df_rolled times the number of columns to extract. This is also why the extraction succeeded when I did it sequentially, taking 3 columns at a time.

@MaxBenChrist
Copy link
Author

MaxBenChrist commented Sep 18, 2018

@yuval-nardi

We internally stack the columns. E.g.

id x y
a 3 10
a 4 11

becomes

id kind value
a x 3
a x 4
a y 10
a y 11

in our internal representation.

Hence, we hit hash_klass(363951638*7) with the 7 columns. So the maximal number of pd.factorize is the issue in this case.

@mroeschke

is there any way we can circumvent that threshold or replace the df.groupby([column_id, column_kind]) with an equivalent expression?

@mroeschke
Copy link
Member

Why does the data need to be pivoted to a long format?

The size of the data seem to already be pushing the limits of pandas already unfortunately; I would imagine processing this data need something like dask: http://dask.pydata.org/en/latest/

cc @jreback

@chris-b1
Copy link
Contributor

khash, our hash table implementation, uses an 32 bit integer throughout for its sizes. maybe could change that, but would be quite invasive.

But as as long as your number of distinct values fits into 32 bit space, the current implementation should work. So as a fix, want to try capping the size hint (to something like INT32_MAX - 1) and see if that fixes?

table = hash_klass(size_hint or len(values))

@MaxBenChrist
Copy link
Author

MaxBenChrist commented Sep 19, 2018

But as as long as your number of distinct values fits into 32 bit space, the current implementation should work. So as a fix, want to try capping the size hint (to something like INT32_MAX - 1) and see if that fixes?

I am pretty sure that this is the reason for the bug. However, we can deal with that. We can split the stacked dataframe into pieces and then calculate the chunk list.

Thanks for the help 👍

khash, our hash table implementation, uses an 32 bit integer throughout for its sizes. maybe could change that, but would be quite invasive.

Do you see this happening in the near future?

I will check the performance of dask for this dask, we already use it in tsfresh anyway

marking blue-yonder/tsfresh#418

@chris-b1
Copy link
Contributor

Cool, in case it's not clear, we'd probably take a PR into pandas that caps the size_hint. As the name implies, it's only a hint to khash to pre-allocate a bunch of memory, ultimately khash will reallocate as necessary based on the actual data.

@MaxBenChrist
Copy link
Author

So, to sum up, the current maximal number of groups in an pandas groupby object is

2^31-1 = 2,147,483,647.

If one creates an groupby object with more groups, it will fail with "OverflowError: value too large to convert to int".

Regarding a PR: I am not really versatile in C, so I have no idea where I would start. I am not even sure I understand what would have to be changed inside pandas, you probably don't want to change that constant for a corner case (extremly large groupby objects)

Cool, in case it's not clear, we'd probably take a PR into pandas that caps the size_hint. As the name implies, it's only a hint to khash to pre-allocate a bunch of memory, ultimately khash will reallocate as necessary based on the actual data.

can you clarify this a little bit, what is size_hint ? a c method?

@chris-b1
Copy link
Contributor

@MaxBenChrist, could you or a user try the PR #22805? I think it should fix, but I don't have a machine with enough memory to do a full reproducing case.

To expand a bit, the underlying limitation is that the number of buckets (~distinct values + some empty space) in the hash table is limited to UINT32_MAX (2^32). But your code wasn't making it that far, it was overflowing on the size_hint parameter, which is only used for initial allocation. So capping that won't impact the normal use case.

@gfyoung gfyoung added Bug Internals Related to non-user accessible pandas implementation and removed Needs Info Clarification about behavior needed to assess issue labels Sep 23, 2018
@MaxBenChrist
Copy link
Author

@chris-b1 Thanks for the pr! I am also lacking a big memory machine, maybe @yuval-nardi can try it?

Otherwise I have to setup a vm this weekend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Groupby Internals Related to non-user accessible pandas implementation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants