Groupby Get Groups fails with "OverflowError: value too large to convert to int" #22729

MaxBenChrist · 2018-09-16T08:47:50Z

Hi, I am the maintainer of tsfresh, we calculate features from time series and rely on pandas internally.

Since we open sourced tsfresh, we had numerous reports of tsfresh crashing on big datasets but were never able to pin it down. The errors seem to occur for big datasets (100 GB+). I also tried to reproduce it but do not have access to a machine that has enough memory at the moment.

Recently, we found the place, it seem to crash at

[x + (y,) for x, y in df.groupby([column_id, column_kind])[column_value]]

df is a dataframe that looks like

        ====  ======  =========
          id  kind          val
        ====  ======  =========
           1  a       -0.21761
           1  a       -0.613667
           1  a       -2.07339
           2  b       -0.576254
           2  b       -1.21924
        ====  ======  =========

and it should get converted into

        [(1, 'a', pd.Series([-0.217610, -0.613667, -2.073386]),
         (2, 'b', pd.Series([-0.576254, -1.219238])]

An exemplary stack trace where this crashes is

data_in_chunks = [x + (y,) for x, y in df.groupby([column_id, column_kind])[column_value]]
File "/home/yuval/pai/projects/pai-tsfresh/pai_tsfresh/feature_extraction/extraction.py", line 217, in 
data_in_chunks = [x + (y,) for x, y in df.groupby([column_id, column_kind])[column_value]]
File "/usr/local/lib/python3.6/dist-packages/pandas/core/groupby.py", line 1922, in get_iterator
splitter = self._get_splitter(data, axis=axis)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/groupby.py", line 1928, in _get_splitter
comp_ids, _, ngroups = self.group_info
File "pandas/_libs/properties.pyx", line 38, in pandas._libs.properties.cache_readonly.get
File "/usr/local/lib/python3.6/dist-packages/pandas/core/groupby.py", line 2040, in group_info
comp_ids, obs_group_ids = self._get_compressed_labels()
File "/usr/local/lib/python3.6/dist-packages/pandas/core/groupby.py", line 2056, in _get_compressed_labels
all_labels = [ping.labels for ping in self.groupings]
File "/usr/local/lib/python3.6/dist-packages/pandas/core/groupby.py", line 2056, in 
all_labels = [ping.labels for ping in self.groupings]
File "/usr/local/lib/python3.6/dist-packages/pandas/core/groupby.py", line 2750, in labels
self._make_labels()
File "/usr/local/lib/python3.6/dist-packages/pandas/core/groupby.py", line 2767, in _make_labels
self.grouper, sort=self.sort)
File "/usr/local/lib/python3.6/dist-packages/pandas/core/algorithms.py", line 468, in factorize
table = hash_klass(size_hint or len(values))
File "pandas/_libs/hashtable_class_helper.pxi", line 1005, in pandas._libs.hashtable.StringHashTable.init
OverflowError: value too large to convert to int

See also the discussion in blue-yonder/tsfresh#418

So, I assume that we hit some kind of threshold. Any idea how to get the groupby more robust for bigger datasets?

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-1018-gcp
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: None
pip: 9.0.1
setuptools: 20.7.0
Cython: None
numpy: 1.15.0
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2018.3
blosc: None
bottleneck: None
tables: None
numexpr: 2.6.4
feather: None
matplotlib: 2.1.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.9999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

mroeschke · 2018-09-17T17:14:16Z

Looked at the related thread briefly; how many unique groups are created when df.groupby([column_id, column_kind]) is called?

My guess from the returned stack trace is that pd.factorize is trying to create integer labels for each group, and the total number of groups is more that what int64 can represent?

yuval-nardi · 2018-09-17T18:56:42Z

@MaxBenChrist @mroeschke
It fails on table = hash_klass(size_hint or len(values)).
When I do hash_klass(363951638*3) it succeeds:

> hash_klass(363951638*3)
<pandas._libs.hashtable.StringHashTable at 0x1227f0cf0>

When I do hash_klass(363951638*7) it fails:

> hash_klass(363951638*7)
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2910, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-28-10a57ee47b42>", line 1, in <module>
    hash_klass(363951638*7)
  File "pandas/_libs/hashtable_class_helper.pxi", line 1229, in pandas._libs.hashtable.StringHashTable.__init__
OverflowError: value too large to convert to int

The 3 and 7 here relate to the 3 and 7 columns mentioned in the above tsfresh thread (blue-yonder/tsfresh#418). 363951638 is the number of rows in my df_rolled dataframe (also mentioned in the thread).

Anyhow, it seems that len(values) is not the number of unique groups but the number of rows of df_rolled times the number of columns to extract. This is also why the extraction succeeded when I did it sequentially, taking 3 columns at a time.

MaxBenChrist · 2018-09-18T08:56:51Z

@yuval-nardi

We internally stack the columns. E.g.

id	x	y
a	3	10
a	4	11

becomes

id	kind	value
a	x	3
a	x	4
a	y	10
a	y	11

in our internal representation.

Hence, we hit hash_klass(363951638*7) with the 7 columns. So the maximal number of pd.factorize is the issue in this case.

@mroeschke

is there any way we can circumvent that threshold or replace the df.groupby([column_id, column_kind]) with an equivalent expression?

mroeschke · 2018-09-18T16:46:26Z

Why does the data need to be pivoted to a long format?

The size of the data seem to already be pushing the limits of pandas already unfortunately; I would imagine processing this data need something like dask: http://dask.pydata.org/en/latest/

cc @jreback

chris-b1 · 2018-09-18T19:34:30Z

khash, our hash table implementation, uses an 32 bit integer throughout for its sizes. maybe could change that, but would be quite invasive.

But as as long as your number of distinct values fits into 32 bit space, the current implementation should work. So as a fix, want to try capping the size hint (to something like INT32_MAX - 1) and see if that fixes?

pandas/pandas/core/algorithms.py

Line 473 in 1a12c41

table = hash_klass(size_hint or len(values))

MaxBenChrist · 2018-09-19T08:41:09Z

But as as long as your number of distinct values fits into 32 bit space, the current implementation should work. So as a fix, want to try capping the size hint (to something like INT32_MAX - 1) and see if that fixes?

I am pretty sure that this is the reason for the bug. However, we can deal with that. We can split the stacked dataframe into pieces and then calculate the chunk list.

Thanks for the help 👍

khash, our hash table implementation, uses an 32 bit integer throughout for its sizes. maybe could change that, but would be quite invasive.

Do you see this happening in the near future?

I will check the performance of dask for this dask, we already use it in tsfresh anyway

marking blue-yonder/tsfresh#418

chris-b1 · 2018-09-19T13:39:21Z

Cool, in case it's not clear, we'd probably take a PR into pandas that caps the size_hint. As the name implies, it's only a hint to khash to pre-allocate a bunch of memory, ultimately khash will reallocate as necessary based on the actual data.

MaxBenChrist · 2018-09-21T14:26:47Z

So, to sum up, the current maximal number of groups in an pandas groupby object is

2^31-1 = 2,147,483,647.

If one creates an groupby object with more groups, it will fail with "OverflowError: value too large to convert to int".

Regarding a PR: I am not really versatile in C, so I have no idea where I would start. I am not even sure I understand what would have to be changed inside pandas, you probably don't want to change that constant for a corner case (extremly large groupby objects)

Cool, in case it's not clear, we'd probably take a PR into pandas that caps the size_hint. As the name implies, it's only a hint to khash to pre-allocate a bunch of memory, ultimately khash will reallocate as necessary based on the actual data.

can you clarify this a little bit, what is size_hint ? a c method?

chris-b1 · 2018-09-22T13:47:06Z

@MaxBenChrist, could you or a user try the PR #22805? I think it should fix, but I don't have a machine with enough memory to do a full reproducing case.

To expand a bit, the underlying limitation is that the number of buckets (~distinct values + some empty space) in the hash table is limited to UINT32_MAX (2^32). But your code wasn't making it that far, it was overflowing on the size_hint parameter, which is only used for initial allocation. So capping that won't impact the normal use case.

MaxBenChrist · 2018-09-25T12:27:36Z

@chris-b1 Thanks for the pr! I am also lacking a big memory machine, maybe @yuval-nardi can try it?

Otherwise I have to setup a vm this weekend.

mroeschke added Groupby Needs Info Clarification about behavior needed to assess issue labels Sep 17, 2018

chris-b1 mentioned this issue Sep 22, 2018

BUG: Hashtable size hint cap #22805

Merged

4 tasks

gfyoung added Bug Internals Related to non-user accessible pandas implementation and removed Needs Info Clarification about behavior needed to assess issue labels Sep 23, 2018

WillAyd mentioned this issue Dec 12, 2018

DatetimeIndex.take() fails if index has more than max int rows #24248

Open

jreback added this to the 0.24.0 milestone Jan 15, 2019

jreback closed this as completed in #22805 Jan 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Groupby Get Groups fails with "OverflowError: value too large to convert to int" #22729

Groupby Get Groups fails with "OverflowError: value too large to convert to int" #22729

MaxBenChrist commented Sep 16, 2018 •

edited

Loading

INSTALLED VERSIONS

mroeschke commented Sep 17, 2018

yuval-nardi commented Sep 17, 2018 •

edited

Loading

MaxBenChrist commented Sep 18, 2018 •

edited

Loading

mroeschke commented Sep 18, 2018

chris-b1 commented Sep 18, 2018

MaxBenChrist commented Sep 19, 2018 •

edited

Loading

chris-b1 commented Sep 19, 2018

MaxBenChrist commented Sep 21, 2018

chris-b1 commented Sep 22, 2018

MaxBenChrist commented Sep 25, 2018

Groupby Get Groups fails with "OverflowError: value too large to convert to int" #22729

Groupby Get Groups fails with "OverflowError: value too large to convert to int" #22729

Comments

MaxBenChrist commented Sep 16, 2018 • edited Loading

Output of pd.show_versions()

INSTALLED VERSIONS

mroeschke commented Sep 17, 2018

yuval-nardi commented Sep 17, 2018 • edited Loading

MaxBenChrist commented Sep 18, 2018 • edited Loading

mroeschke commented Sep 18, 2018

chris-b1 commented Sep 18, 2018

MaxBenChrist commented Sep 19, 2018 • edited Loading

chris-b1 commented Sep 19, 2018

MaxBenChrist commented Sep 21, 2018

chris-b1 commented Sep 22, 2018

MaxBenChrist commented Sep 25, 2018

MaxBenChrist commented Sep 16, 2018 •

edited

Loading

Output of `pd.show_versions()`

yuval-nardi commented Sep 17, 2018 •

edited

Loading

MaxBenChrist commented Sep 18, 2018 •

edited

Loading

MaxBenChrist commented Sep 19, 2018 •

edited

Loading