Skip to content

BUG: pandas.DataFrame.interpolate fails with high value of limit argument #34936

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 of 3 tasks
monstrorivas opened this issue Jun 22, 2020 · 8 comments
Open
2 of 3 tasks
Labels
32bit 32-bit systems Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate

Comments

@monstrorivas
Copy link

monstrorivas commented Jun 22, 2020

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

import pandas as pd
import numpy as np
df = pd.DataFrame([1]*500000)
df.iloc[1000:50000] =np.nan
df.interpolate(method='linear', limit_direction='both', limit=None)  # This runs fine eventhough the limit is effectively > 5000 datapoints
df.interpolate(method='linear', limit_direction='both', limit=5000)  # This produces an error

Problem description

An error is produced when specifying a large limit in pandas.DataFrame.Interpolate
The error is NOT present in pandas 1.0.1 but it is present at least in 1.0.4 and 1.0.5
If the limit is set to None, there is no error... even when the interpolated consecutive nans is larger than the limit that fails

Error:

ValueError: array is too big; `arr.size * arr.dtype.itemsize` is larger than the maximum possible size.

The error is with Python 3.7 but not with Python 3.6

Expected Output

The expected output is what pandas v1.0.1 produces.

In python 3.6, specifying a large value of limit doesn't result in a ValueError

This runs in v1.0.1

import pandas as pd
import numpy as np
df = pd.DataFrame([1]*500000)
df.iloc[1000:50000] =np.nan
dff = df.interpolate(method='linear', limit_direction='both', limit=5000) 
assert dff.isna().sum().values == 39000

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.4.final.0
python-bits : 32
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : None.None

pandas : 1.0.5
numpy : 1.18.5
pytz : 2020.1
dateutil : 2.8.1
pip : 20.1.1
setuptools : 47.1.1
Cython : 0.29.20
pytest : 5.4.3
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.15.0
pandas_datareader: None
bs4 : 4.9.1
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.2.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : 5.4.3
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None

@monstrorivas monstrorivas added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 22, 2020
@monstrorivas
Copy link
Author

monstrorivas commented Jun 22, 2020

Upon further testing, it seems that the issue doesn't come from the pandas versions but from the python version

It looks like it works just fine with python 3.6 but gets the error with 3.7. I updated the original description above

@simonjayhawkins
Copy link
Member

Thanks @monstrorivas for the report. not able to reproduce on Python 3.7.7 on Linux or Python 3.8.3 on Windows.

can you post the full traceback?

@simonjayhawkins simonjayhawkins added Needs Info Clarification about behavior needed to assess issue Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 23, 2020
@monstrorivas
Copy link
Author

Here's the traceback of the error.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-10-ef5cfa224e91> in <module>
      4 df.iloc[1000:50000] =np.nan
      5 df.interpolate(method='linear', limit_direction='both', limit=None)  # This runs fine eventhough the limit is effectively > 5000 datapoints
----> 6 df.interpolate(method='linear', limit_direction='both', limit=5000)  # This produces an error

c:\users\alberto\projects\virtualenvs\causality_dev\lib\site-packages\pandas\core\generic.py in interpolate(self, method, axis, limit, inplace, limit_direction, limit_area, downcast, **kwargs)
   7015             inplace=inplace,
   7016             downcast=downcast,
-> 7017             **kwargs,
   7018         )
   7019 

c:\users\alberto\projects\virtualenvs\causality_dev\lib\site-packages\pandas\core\internals\managers.py in interpolate(self, **kwargs)
    568 
    569     def interpolate(self, **kwargs):
--> 570         return self.apply("interpolate", **kwargs)
    571 
    572     def shift(self, **kwargs):

c:\users\alberto\projects\virtualenvs\causality_dev\lib\site-packages\pandas\core\internals\managers.py in apply(self, f, filter, **kwargs)
    440                 applied = b.apply(f, **kwargs)
    441             else:
--> 442                 applied = getattr(b, f)(**kwargs)
    443             result_blocks = _extend_blocks(applied, result_blocks)
    444 

c:\users\alberto\projects\virtualenvs\causality_dev\lib\site-packages\pandas\core\internals\blocks.py in interpolate(self, method, axis, index, values, inplace, limit, limit_direction, limit_area, fill_value, coerce, downcast, **kwargs)
   1174             inplace=inplace,
   1175             downcast=downcast,
-> 1176             **kwargs,
   1177         )
   1178 

c:\users\alberto\projects\virtualenvs\causality_dev\lib\site-packages\pandas\core\internals\blocks.py in _interpolate(self, method, index, values, fill_value, axis, limit, limit_direction, limit_area, inplace, downcast, **kwargs)
   1270 
   1271         # interp each column independently
-> 1272         interp_values = np.apply_along_axis(func, axis, data)
   1273 
   1274         blocks = [self.make_block_same_class(interp_values)]

<__array_function__ internals> in apply_along_axis(*args, **kwargs)

c:\users\alberto\projects\virtualenvs\causality_dev\lib\site-packages\numpy\lib\shape_base.py in apply_along_axis(func1d, axis, arr, *args, **kwargs)
    377     except StopIteration:
    378         raise ValueError('Cannot apply_along_axis when any iteration dimensions are 0')
--> 379     res = asanyarray(func1d(inarr_view[ind0], *args, **kwargs))
    380 
    381     # build a buffer for storing evaluations of func1d.

c:\users\alberto\projects\virtualenvs\causality_dev\lib\site-packages\pandas\core\internals\blocks.py in func(x)
   1266                 fill_value=fill_value,
   1267                 bounds_error=False,
-> 1268                 **kwargs,
   1269             )
   1270 

c:\users\alberto\projects\virtualenvs\causality_dev\lib\site-packages\pandas\core\missing.py in interpolate_1d(xvalues, yvalues, method, limit, limit_direction, limit_area, fill_value, bounds_error, order, **kwargs)
    244     else:
    245         # both directions... just use _interp_limit
--> 246         preserve_nans = set(_interp_limit(invalid, limit, limit))
    247 
    248     # if limit_area is set, add either mid or outside indices

c:\users\alberto\projects\virtualenvs\causality_dev\lib\site-packages\pandas\core\missing.py in _interp_limit(invalid, fw_limit, bw_limit)
    651             f_idx = set(np.where(invalid)[0])
    652         else:
--> 653             f_idx = inner(invalid, fw_limit)
    654 
    655     if bw_limit is not None:

c:\users\alberto\projects\virtualenvs\causality_dev\lib\site-packages\pandas\core\missing.py in inner(invalid, limit)
    640     def inner(invalid, limit):
    641         limit = min(limit, N)
--> 642         windowed = _rolling_window(invalid, limit + 1).all(1)
    643         idx = set(np.where(windowed)[0] + limit) | set(
    644             np.where((~invalid[: limit + 1]).cumsum() == 0)[0]

c:\users\alberto\projects\virtualenvs\causality_dev\lib\site-packages\pandas\core\missing.py in _rolling_window(a, window)
    682     shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    683     strides = a.strides + (a.strides[-1],)
--> 684     return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)

c:\users\alberto\projects\virtualenvs\causality_dev\lib\site-packages\numpy\lib\stride_tricks.py in as_strided(x, shape, strides, subok, writeable)
    101         interface['strides'] = tuple(strides)
    102 
--> 103     array = np.asarray(DummyArray(interface, base=x))
    104     # The route via `__interface__` does not preserve structured
    105     # dtypes. Since dtype should remain unchanged, we set it explicitly.

c:\users\alberto\projects\virtualenvs\causality_dev\lib\site-packages\numpy\core\_asarray.py in asarray(a, dtype, order)
     83 
     84     """
---> 85     return array(a, dtype, copy=False, order=order)
     86 
     87 

ValueError: array is too big; `arr.size * arr.dtype.itemsize` is larger than the maximum possible size.

@simonjayhawkins
Copy link
Member

Thanks @monstrorivas for the detail. There's an open PR, #34727, that doesn't use the function where the error is originating, pandas\core\missing.py in _rolling_window(a, window)

@simonjayhawkins simonjayhawkins removed the Needs Info Clarification about behavior needed to assess issue label Jun 23, 2020
@monstrorivas
Copy link
Author

monstrorivas commented Jun 23, 2020

Thanks @simonjayhawkins for the information on that PR.

After your comment, I did more testing on different systems. I can confirm that I can't reproduce it either on Linux with Python 3.7.3

Also, I just tried on another Windows system with Python 3.7.4 and could NOT reproduce it. I'm a bit confused on what may be triggering this issue

@simonjayhawkins
Copy link
Member

I'm a bit confused on what may be triggering this issue

The error is originating from NumPy. so could be down the Numpy version being used.

@monstrorivas
Copy link
Author

ok... I think I figured out what's causing it
It looks like the error is produced only on the 32-bit version of Python

I was able to reproduce it on different Windows machines with Python 3.7.4 (32-bit). When I switch to the 64-bit version there is no issue. Now that I think about it, the ValueError starts to make sense.

What doesn't makes sense from a user perspective is that with limit=None it goes through even though it has to interpolate even more values in my example.

@simonjayhawkins
Copy link
Member

It looks like the error is produced only on the 32-bit version of Python

Thanks for investigating further

@simonjayhawkins simonjayhawkins added the 32bit 32-bit systems label Jun 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
32bit 32-bit systems Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

No branches or pull requests

2 participants