Description
System information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): MacOS 10.15.7
- Modin version (
modin.__version__
): 0.8.2 - Python version: 3.8.5
- Code we can use to reproduce:
import pandas as pd
print("===PANDAS===")
s = pd.Series(['green'])
print(s)
print(type(s))
su = s.unique()
# Leads to same error as modin's unique()
# su = s.unique().squeeze()
print(su)
print(type(su))
print(len(su))
import modin.pandas as md
print("\n===MODIN===")
s = md.Series(['green'])
print(s)
print(type(s))
su = s.unique()
print(su)
print(type(su))
print(len(su))
Describe the problem
Whenever unique
is called on a Series and there is only one unique value, Modin will output a scalar numpy value whereas Pandas will output an numpy array of length 1. As a result, trying to call len
on Modin's unique
result throws an error because scalar values do not have an len
attribute, but Pandas does not. This is likely because Modin's implementation calls squeeze
as squeezing an array of length 1 transforms it into a scalar.
This error does not occur when there are two or more unique values. The solution could be to remove squeeze
from Modin's unique
implementation. I will do more testing and try to follow up with a PR.
Source code / logs
Log from above code to reproduce:
===PANDAS===
0 green
dtype: object
<class 'pandas.core.series.Series'>
['green']
<class 'numpy.ndarray'>
1
===MODIN===
0 green
dtype: object
<class 'modin.pandas.series.Series'>
green
<class 'numpy.ndarray'>
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-1-c4f0aa247643> in <module>
19 print(su)
20 print(type(su))
---> 21 print(len(su))
TypeError: len() of unsized object
Source code for Modin's unique
(calls squeeze
after to_numpy
):
Lines 1347 to 1348 in c86422a