-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Groupby Get Groups fails with "OverflowError: value too large to convert to int" #22729
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Looked at the related thread briefly; how many unique groups are created when My guess from the returned stack trace is that |
@MaxBenChrist @mroeschke
When I do
The 3 and 7 here relate to the 3 and 7 columns mentioned in the above tsfresh thread (blue-yonder/tsfresh#418). Anyhow, it seems that |
We internally stack the columns. E.g.
becomes
in our internal representation. Hence, we hit is there any way we can circumvent that threshold or replace the |
Why does the data need to be pivoted to a long format? The size of the data seem to already be pushing the limits of pandas already unfortunately; I would imagine processing this data need something like dask: http://dask.pydata.org/en/latest/ cc @jreback |
khash, our hash table implementation, uses an 32 bit integer throughout for its sizes. maybe could change that, but would be quite invasive. But as as long as your number of distinct values fits into 32 bit space, the current implementation should work. So as a fix, want to try capping the size hint (to something like pandas/pandas/core/algorithms.py Line 473 in 1a12c41
|
I am pretty sure that this is the reason for the bug. However, we can deal with that. We can split the stacked dataframe into pieces and then calculate the chunk list. Thanks for the help 👍
Do you see this happening in the near future? I will check the performance of dask for this dask, we already use it in tsfresh anyway marking blue-yonder/tsfresh#418 |
Cool, in case it's not clear, we'd probably take a PR into pandas that caps the |
So, to sum up, the current maximal number of groups in an pandas groupby object is
If one creates an groupby object with more groups, it will fail with "OverflowError: value too large to convert to int". Regarding a PR: I am not really versatile in C, so I have no idea where I would start. I am not even sure I understand what would have to be changed inside pandas, you probably don't want to change that constant for a corner case (extremly large groupby objects)
can you clarify this a little bit, what is |
@MaxBenChrist, could you or a user try the PR #22805? I think it should fix, but I don't have a machine with enough memory to do a full reproducing case. To expand a bit, the underlying limitation is that the number of buckets (~distinct values + some empty space) in the hash table is limited to UINT32_MAX (2^32). But your code wasn't making it that far, it was overflowing on the |
@chris-b1 Thanks for the pr! I am also lacking a big memory machine, maybe @yuval-nardi can try it? Otherwise I have to setup a vm this weekend. |
Hi, I am the maintainer of tsfresh, we calculate features from time series and rely on pandas internally.
Since we open sourced tsfresh, we had numerous reports of tsfresh crashing on big datasets but were never able to pin it down. The errors seem to occur for big datasets (100 GB+). I also tried to reproduce it but do not have access to a machine that has enough memory at the moment.
Recently, we found the place, it seem to crash at
df is a dataframe that looks like
and it should get converted into
An exemplary stack trace where this crashes is
See also the discussion in blue-yonder/tsfresh#418
So, I assume that we hit some kind of threshold. Any idea how to get the groupby more robust for bigger datasets?
Output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-1018-gcp
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.23.4
pytest: None
pip: 9.0.1
setuptools: 20.7.0
Cython: None
numpy: 1.15.0
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2018.3
blosc: None
bottleneck: None
tables: None
numexpr: 2.6.4
feather: None
matplotlib: 2.1.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.9999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: