BUG: pandas.DataFrame.interpolate fails with high value of `limit` argument #34936

monstrorivas · 2020-06-22T20:00:53Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Code Sample, a copy-pastable example

import pandas as pd
import numpy as np
df = pd.DataFrame([1]*500000)
df.iloc[1000:50000] =np.nan
df.interpolate(method='linear', limit_direction='both', limit=None)  # This runs fine eventhough the limit is effectively > 5000 datapoints
df.interpolate(method='linear', limit_direction='both', limit=5000)  # This produces an error

Problem description

An error is produced when specifying a large limit in pandas.DataFrame.Interpolate
~~The error is NOT present in pandas 1.0.1 but it is present at least in 1.0.4 and 1.0.5~~
If the limit is set to None, there is no error... even when the interpolated consecutive nans is larger than the limit that fails

Error:

ValueError: array is too big; `arr.size * arr.dtype.itemsize` is larger than the maximum possible size.

The error is with Python 3.7 but not with Python 3.6

Expected Output

~~The expected output is what pandas v1.0.1 produces.~~

In python 3.6, specifying a large value of limit doesn't result in a ValueError

This runs in v1.0.1

import pandas as pd
import numpy as np
df = pd.DataFrame([1]*500000)
df.iloc[1000:50000] =np.nan
dff = df.interpolate(method='linear', limit_direction='both', limit=5000) 
assert dff.isna().sum().values == 39000

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None
python : 3.7.4.final.0
python-bits : 32
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : None.None

pandas : 1.0.5
numpy : 1.18.5
pytz : 2020.1
dateutil : 2.8.1
pip : 20.1.1
setuptools : 47.1.1
Cython : 0.29.20
pytest : 5.4.3
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.15.0
pandas_datareader: None
bs4 : 4.9.1
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.2.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : 5.4.3
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None

The text was updated successfully, but these errors were encountered:

monstrorivas · 2020-06-22T22:44:34Z

Upon further testing, it seems that the issue doesn't come from the pandas versions but from the python version

It looks like it works just fine with python 3.6 but gets the error with 3.7. I updated the original description above

simonjayhawkins · 2020-06-23T12:29:20Z

Thanks @monstrorivas for the report. not able to reproduce on Python 3.7.7 on Linux or Python 3.8.3 on Windows.

can you post the full traceback?

monstrorivas · 2020-06-23T15:09:23Z

Here's the traceback of the error.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-10-ef5cfa224e91> in <module>
      4 df.iloc[1000:50000] =np.nan
      5 df.interpolate(method='linear', limit_direction='both', limit=None)  # This runs fine eventhough the limit is effectively > 5000 datapoints
----> 6 df.interpolate(method='linear', limit_direction='both', limit=5000)  # This produces an error

c:\users\alberto\projects\virtualenvs\causality_dev\lib\site-packages\pandas\core\generic.py in interpolate(self, method, axis, limit, inplace, limit_direction, limit_area, downcast, **kwargs)
   7015             inplace=inplace,
   7016             downcast=downcast,
-> 7017             **kwargs,
   7018         )
   7019 

c:\users\alberto\projects\virtualenvs\causality_dev\lib\site-packages\pandas\core\internals\managers.py in interpolate(self, **kwargs)
    568 
    569     def interpolate(self, **kwargs):
--> 570         return self.apply("interpolate", **kwargs)
    571 
    572     def shift(self, **kwargs):

c:\users\alberto\projects\virtualenvs\causality_dev\lib\site-packages\pandas\core\internals\managers.py in apply(self, f, filter, **kwargs)
    440                 applied = b.apply(f, **kwargs)
    441             else:
--> 442                 applied = getattr(b, f)(**kwargs)
    443             result_blocks = _extend_blocks(applied, result_blocks)
    444 

c:\users\alberto\projects\virtualenvs\causality_dev\lib\site-packages\pandas\core\internals\blocks.py in interpolate(self, method, axis, index, values, inplace, limit, limit_direction, limit_area, fill_value, coerce, downcast, **kwargs)
   1174             inplace=inplace,
   1175             downcast=downcast,
-> 1176             **kwargs,
   1177         )
   1178 

c:\users\alberto\projects\virtualenvs\causality_dev\lib\site-packages\pandas\core\internals\blocks.py in _interpolate(self, method, index, values, fill_value, axis, limit, limit_direction, limit_area, inplace, downcast, **kwargs)
   1270 
   1271         # interp each column independently
-> 1272         interp_values = np.apply_along_axis(func, axis, data)
   1273 
   1274         blocks = [self.make_block_same_class(interp_values)]

<__array_function__ internals> in apply_along_axis(*args, **kwargs)

c:\users\alberto\projects\virtualenvs\causality_dev\lib\site-packages\numpy\lib\shape_base.py in apply_along_axis(func1d, axis, arr, *args, **kwargs)
    377     except StopIteration:
    378         raise ValueError('Cannot apply_along_axis when any iteration dimensions are 0')
--> 379     res = asanyarray(func1d(inarr_view[ind0], *args, **kwargs))
    380 
    381     # build a buffer for storing evaluations of func1d.

c:\users\alberto\projects\virtualenvs\causality_dev\lib\site-packages\pandas\core\internals\blocks.py in func(x)
   1266                 fill_value=fill_value,
   1267                 bounds_error=False,
-> 1268                 **kwargs,
   1269             )
   1270 

c:\users\alberto\projects\virtualenvs\causality_dev\lib\site-packages\pandas\core\missing.py in interpolate_1d(xvalues, yvalues, method, limit, limit_direction, limit_area, fill_value, bounds_error, order, **kwargs)
    244     else:
    245         # both directions... just use _interp_limit
--> 246         preserve_nans = set(_interp_limit(invalid, limit, limit))
    247 
    248     # if limit_area is set, add either mid or outside indices

c:\users\alberto\projects\virtualenvs\causality_dev\lib\site-packages\pandas\core\missing.py in _interp_limit(invalid, fw_limit, bw_limit)
    651             f_idx = set(np.where(invalid)[0])
    652         else:
--> 653             f_idx = inner(invalid, fw_limit)
    654 
    655     if bw_limit is not None:

c:\users\alberto\projects\virtualenvs\causality_dev\lib\site-packages\pandas\core\missing.py in inner(invalid, limit)
    640     def inner(invalid, limit):
    641         limit = min(limit, N)
--> 642         windowed = _rolling_window(invalid, limit + 1).all(1)
    643         idx = set(np.where(windowed)[0] + limit) | set(
    644             np.where((~invalid[: limit + 1]).cumsum() == 0)[0]

c:\users\alberto\projects\virtualenvs\causality_dev\lib\site-packages\pandas\core\missing.py in _rolling_window(a, window)
    682     shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    683     strides = a.strides + (a.strides[-1],)
--> 684     return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)

c:\users\alberto\projects\virtualenvs\causality_dev\lib\site-packages\numpy\lib\stride_tricks.py in as_strided(x, shape, strides, subok, writeable)
    101         interface['strides'] = tuple(strides)
    102 
--> 103     array = np.asarray(DummyArray(interface, base=x))
    104     # The route via `__interface__` does not preserve structured
    105     # dtypes. Since dtype should remain unchanged, we set it explicitly.

c:\users\alberto\projects\virtualenvs\causality_dev\lib\site-packages\numpy\core\_asarray.py in asarray(a, dtype, order)
     83 
     84     """
---> 85     return array(a, dtype, copy=False, order=order)
     86 
     87 

ValueError: array is too big; `arr.size * arr.dtype.itemsize` is larger than the maximum possible size.

simonjayhawkins · 2020-06-23T15:15:33Z

Thanks @monstrorivas for the detail. There's an open PR, #34727, that doesn't use the function where the error is originating, pandas\core\missing.py in _rolling_window(a, window)

monstrorivas · 2020-06-23T16:34:49Z

Thanks @simonjayhawkins for the information on that PR.

After your comment, I did more testing on different systems. I can confirm that I can't reproduce it either on Linux with Python 3.7.3

Also, I just tried on another Windows system with Python 3.7.4 and could NOT reproduce it. I'm a bit confused on what may be triggering this issue

simonjayhawkins · 2020-06-23T17:53:26Z

I'm a bit confused on what may be triggering this issue

The error is originating from NumPy. so could be down the Numpy version being used.

monstrorivas · 2020-06-23T21:25:50Z

ok... I think I figured out what's causing it
It looks like the error is produced only on the 32-bit version of Python

I was able to reproduce it on different Windows machines with Python 3.7.4 (32-bit). When I switch to the 64-bit version there is no issue. Now that I think about it, the ValueError starts to make sense.

What doesn't makes sense from a user perspective is that with limit=None it goes through even though it has to interpolate even more values in my example.

simonjayhawkins · 2020-06-24T08:58:50Z

It looks like the error is produced only on the 32-bit version of Python

Thanks for investigating further

monstrorivas added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 22, 2020

simonjayhawkins added Needs Info Clarification about behavior needed to assess issue Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 23, 2020

simonjayhawkins removed the Needs Info Clarification about behavior needed to assess issue label Jun 23, 2020

simonjayhawkins added the 32bit 32-bit systems label Jun 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: pandas.DataFrame.interpolate fails with high value of `limit` argument #34936

BUG: pandas.DataFrame.interpolate fails with high value of `limit` argument #34936

monstrorivas commented Jun 22, 2020 •

edited

Loading

INSTALLED VERSIONS

monstrorivas commented Jun 22, 2020 •

edited

Loading

simonjayhawkins commented Jun 23, 2020

monstrorivas commented Jun 23, 2020

simonjayhawkins commented Jun 23, 2020

monstrorivas commented Jun 23, 2020 •

edited

Loading

simonjayhawkins commented Jun 23, 2020

monstrorivas commented Jun 23, 2020

simonjayhawkins commented Jun 24, 2020

BUG: pandas.DataFrame.interpolate fails with high value of limit argument #34936

BUG: pandas.DataFrame.interpolate fails with high value of limit argument #34936

Comments

monstrorivas commented Jun 22, 2020 • edited Loading

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

monstrorivas commented Jun 22, 2020 • edited Loading

simonjayhawkins commented Jun 23, 2020

monstrorivas commented Jun 23, 2020

simonjayhawkins commented Jun 23, 2020

monstrorivas commented Jun 23, 2020 • edited Loading

simonjayhawkins commented Jun 23, 2020

monstrorivas commented Jun 23, 2020

simonjayhawkins commented Jun 24, 2020

BUG: pandas.DataFrame.interpolate fails with high value of `limit` argument #34936

BUG: pandas.DataFrame.interpolate fails with high value of `limit` argument #34936

monstrorivas commented Jun 22, 2020 •

edited

Loading

Output of `pd.show_versions()`

monstrorivas commented Jun 22, 2020 •

edited

Loading

monstrorivas commented Jun 23, 2020 •

edited

Loading