Skip to content

DataFrame.iat indexing with duplicate columns #11754

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wavexx opened this issue Dec 3, 2015 · 23 comments · Fixed by #33362
Closed

DataFrame.iat indexing with duplicate columns #11754

wavexx opened this issue Dec 3, 2015 · 23 comments · Fixed by #33362
Labels
good first issue Indexing Related to indexing on series/frames, not to indexes themselves Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@wavexx
Copy link

wavexx commented Dec 3, 2015

# error
In [27]: pd.DataFrame([[1, 1]], columns=['x','x']).iat[0,0]
TypeError: len() of unsized object

# ok
In [26]: pd.DataFrame([[1, 1]], columns=['x','y']).iat[0,0]
Out[26]: 1

I have some weird issue in a DataFrame I'm creating from a row-based array.
Using python3 and pandas 0.17.1 (from debian unstable), I get:

df = pandas.DataFrame(data=data[1:], columns=data[0])
df.iat[0, 0]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3/dist-packages/pandas/core/indexing.py", line 1555, in __getitem__
    return self.obj.get_value(*key, takeable=self._takeable)
  File "/usr/lib/python3/dist-packages/pandas/core/frame.py", line 1808, in get_value
    series = self._iget_item_cache(col)
  File "/usr/lib/python3/dist-packages/pandas/core/generic.py", line 1116, in _iget_item_cache
    lower = self.take(item, axis=self._info_axis_number, convert=True)
  File "/usr/lib/python3/dist-packages/pandas/core/generic.py", line 1371, in take
    convert=True, verify=True)
  File "/usr/lib/python3/dist-packages/pandas/core/internals.py", line 3628, in take
    axis=axis, allow_dups=True)
  File "/usr/lib/python3/dist-packages/pandas/core/internals.py", line 3510, in reindex_indexer
    indexer, fill_tuple=(fill_value,))
  File "/usr/lib/python3/dist-packages/pandas/core/internals.py", line 3536, in _slice_take_blocks_ax0
    slice_or_indexer, self.shape[0], allow_fill=allow_fill)
  File "/usr/lib/python3/dist-packages/pandas/core/internals.py", line 4865, in _preprocess_slice_or_indexer
    return 'fancy', indexer, len(indexer)
TypeError: len() of unsized object

Interestingly, I can otherwise manage the dataframe just fine.
The same code, running under python2.7 shows no issue.

What could be the cause of such error?

@TomAugspurger
Copy link
Contributor

data isn't defined. Can you post a copy-pastable example?

@wavexx
Copy link
Author

wavexx commented Dec 3, 2015

The dataset was too big, and trying to reduce the example proved to be harder than I though (the dataset is merged over and over in several loops).

Any hint in what could cause this knowing the traceback? Or what to look for in the resulting df so that I can come up with an example?

@jreback
Copy link
Contributor

jreback commented Dec 3, 2015

pls post pd.show_versions()
show data.info() right before the call to .iat as well as data.head()

@wavexx
Copy link
Author

wavexx commented Dec 3, 2015

INSTALLED VERSIONS
------------------
commit: None
python: 3.4.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.2.0-1-amd64
machine: x86_64
processor: 
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.17.1
nose: 1.3.6
pip: None
setuptools: 18.7
Cython: 0.23.2
numpy: 1.9.2
scipy: 0.16.1
statsmodels: None
IPython: 2.3.0
sphinx: None
patsy: 0.4.1
dateutil: 2.4.2
pytz: 2012c
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.4.3
matplotlib: 1.5.0rc2
openpyxl: 2.3.0
xlrd: 0.9.4
xlwt: None
xlsxwriter: None
lxml: 3.4.4
bs4: None
html5lib: 0.999
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
Jinja2: None

The dataset has too many columns (>2k) for data.head() to be of any use here.
I tried subsetting it while it still triggers the issue, and I now noticed this in data.info():

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1337 entries, 0 to 1336
Data columns (total 265 columns):
TYPE           1337 non-null object
               1337 non-null     object
    object
dtype: object
dtype: object
CODE           1337 non-null object
SAMPLE-TYP     1337 non-null object
C8:0           1337 non-null object
C14:0          1337 non-null object
C16:0          1337 non-null object
C17:0          1337 non-null object
C18:1          1337 non-null object
C18:0          1337 non-null object

Notice the first/second column types, are pasted here literally (looks like come memory corruption). Indeed, I also noticed now that if I break out of the loop with an exit() python segfaults...

@jreback
Copy link
Contributor

jreback commented Dec 3, 2015

it looks like you have an actual pandas object (DataFrame maybe) inside each cell in the TYPE column. not surprised it doesn't work then. This is theory is ok, except for indexing does not work very well with this.

show df.iloc[0] and df.iloc[0,0] (if they don't trigger errors)

@wavexx
Copy link
Author

wavexx commented Dec 3, 2015

They work:

>>> data.iloc[0]
CODE                  XS05301
SAMPLE-TYP            Analyte
PLPE_PE_28:0%         2.29483
PLPE_16:0/16:1%    0.00195465
PLPE_16:0/16:0%    0.00357268
PLPE_16:0/18:3%    0.00302107
....

>>> data.iloc[0,0]
'XS05301'

@jreback
Copy link
Contributor

jreback commented Dec 3, 2015

show the TYPE columns as that the one with the issue (and do a type(data.iloc[....])

@wavexx
Copy link
Author

wavexx commented Dec 3, 2015

It's just a constant string:

>>> data.TYPE.describe()
count      1337
unique        1
top       Tirol
freq       1337
Name: TYPE, dtype: object

I actually drop it later. To me it looks like it's the second column that could be potentially doubtful.

But I'll have to dig into this more closely, now that I noticed the segfault, and it's quite reproducible, there are a few modules (such as xlrd) from my test program that I can remove by going through a few more hoops.

@wavexx
Copy link
Author

wavexx commented Dec 3, 2015

Turns out, data[0] (the header list) contains at one step several names which are identical.

@wavexx
Copy link
Author

wavexx commented Dec 3, 2015

Indeed, that seem to be the only issue I had. By ensuring names are unique, I also don't have more segmentation faults.

Would it make sense to check the value provided directly in the constructor, or is it too expensive?

@jreback
Copy link
Contributor

jreback commented Dec 4, 2015

so can you post a short repro?

@wavexx
Copy link
Author

wavexx commented Dec 4, 2015

data = pd.DataFrame([[1, 1]], columns=['x','x'])
data.iat[0,0]

@jreback
Copy link
Contributor

jreback commented Dec 4, 2015

@wavexx thanks!

# error
In [27]: pd.DataFrame([[1, 1]], columns=['x','x']).iat[0,0]
TypeError: len() of unsized object

# ok
In [26]: pd.DataFrame([[1, 1]], columns=['x','y']).iat[0,0]
Out[26]: 1

@jreback jreback added Bug Indexing Related to indexing on series/frames, not to indexes themselves Difficulty Intermediate labels Dec 4, 2015
@jreback jreback added this to the Next Major Release milestone Dec 4, 2015
@jreback
Copy link
Contributor

jreback commented Dec 4, 2015

pull-requests are welcome!

@jreback jreback changed the title DataFrame.iat indexing error with python3 DataFrame.iat indexing with duplicate columns Dec 4, 2015
@wavexx
Copy link
Author

wavexx commented Dec 4, 2015

What should be the desired outcome here? Handling duplicate columns in indexing through-out, or refusing to handle duplicate columns?

@jreback
Copy link
Contributor

jreback commented Dec 4, 2015

duplicate indexing is handled pretty well, see how .iloc does it. this has prob just not been updated.

@wavexx
Copy link
Author

wavexx commented Dec 4, 2015

Ok. But there's probably some other bug which I still didn't figure out.
In my earlier example I showed:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1337 entries, 0 to 1336
Data columns (total 265 columns):
TYPE           1337 non-null object
               1337 non-null     object
    object

That second column turned out to be named as a single space ( ), which was fetched from a hidden column from xlrd (yeah.. I know.) and was duplicated. During a merge, that duplicated column resulted somehow in a nested dataframe as you suggested.

I couldn't reproduce it succinctly yet, but I think pd.merge might also have some other edge cases with duplicated columns.

@jreback
Copy link
Contributor

jreback commented Dec 4, 2015

@wavexx maybe, but this is a clear easy-to-repro bug. if you can find the source of the other then pls open a new report.

@wavexx
Copy link
Author

wavexx commented Dec 4, 2015

Since there's intention to handle duplicated indexing, I opened a couple of issues for some cases I think should be improved.

@hayd
Copy link
Contributor

hayd commented Feb 6, 2017

Related for at:

In [11]: pd.DataFrame([[1, 1]], columns=['x','x']).at[0,'x']
AttributeError: 'BlockManager' object has no attribute 'T'

@wavexx
Copy link
Author

wavexx commented Feb 23, 2019

Has anybody got a shot at fixing this? I still get bitten by this from time to time. I wouldn't say duplicate columns name are "well" handled until plain ordinal intexing doesn't even work :(

@simonjayhawkins
Copy link
Member

This is fixed on master

>>> import pandas as pd
>>>
>>> pd.__version__
'1.1.0.dev0+1029.gbdf969cd6'
>>>
>>> pd.DataFrame([[1, 1]], columns=["x", "x"]).iat[0, 0]
1

@simonjayhawkins simonjayhawkins added Needs Tests Unit test(s) needed to prevent regressions and removed Bug labels Mar 30, 2020
@simonjayhawkins
Copy link
Member

This is fixed on master

fixed in #32089 which is not yet released, so could also add a whatsnew to 1.1 for this issue.

dafec63 is the first new commit
commit dafec63
Author: jbrockmendel [email protected]
Date: Sat Feb 22 07:56:03 2020 -0800

BUG: DataFrame.iat incorrectly wrapping datetime objects (#32089)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Indexing Related to indexing on series/frames, not to indexes themselves Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants