Skip to content

BUG: set_index screws up the dtypes on empty DataFrames #38419

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
Rufflewind opened this issue Dec 11, 2020 · 8 comments · Fixed by #38430
Closed
2 of 3 tasks

BUG: set_index screws up the dtypes on empty DataFrames #38419

Rufflewind opened this issue Dec 11, 2020 · 8 comments · Fixed by #38430
Labels
Dtype Conversions Unexpected or buggy dtype conversions Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@Rufflewind
Copy link
Contributor

  • I have checked that this issue has not already been reported.

    DataFrame.set_index() may not preserve dtype #30517 is similar in concept but relates to non-empty DataFrames, whereas the current bug is about empty ones.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

import pandas
d1 = pandas.DataFrame({'a': pandas.Series(dtype='datetime64[ns]'), 'b': pandas.Series(dtype='int64'), 'c': []})
d2 = d1.set_index(['a', 'b'])
assert (d1.loc[:, ['a', 'b']].dtypes == d2.index.to_frame().dtypes).all()
>>> d1.loc[:, ['a', 'b']].dtypes
a    datetime64[ns]
b             int64
dtype: object
>> d2.index.to_frame().dtypes
a    object
b     int64
dtype: object

Problem description

The dtype of the columns are silently changed when .set_index is called. This only happens when the DataFrame is empty, which suggests that this is an edge-case bug.

This is problematic because the behavior of .set_index() varies depending on whether the DataFrame has any rows.

Expected Output

The assertions should not fail.

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : b5958ee1999e9aead1938c0bba2b674378807b3d
python           : 3.9.0.final.0
python-bits      : 64
OS               : Linux
OS-release       : 5.9.13-arch1-1
Version          : #1 SMP PREEMPT Tue, 08 Dec 2020 12:09:55 +0000
machine          : x86_64
processor        : 
byteorder        : little
LC_ALL           : None
LANG             : en_FYL.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.1.5
numpy            : 1.19.4
pytz             : 2020.4
dateutil         : 2.8.1
pip              : 20.2.1
setuptools       : 50.3.2
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : 4.6.2
html5lib         : 1.1
pymysql          : None
psycopg2         : None
jinja2           : 2.11.2
IPython          : 7.19.0
pandas_datareader: None
bs4              : 4.9.3
bottleneck       : None
fsspec           : None
fastparquet      : None
gcsfs            : None
matplotlib       : 3.3.3
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pytables         : None
pyxlsb           : None
s3fs             : None
scipy            : 1.5.4
sqlalchemy       : 1.3.20
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
numba            : None
@Rufflewind Rufflewind added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 11, 2020
@skvrahul
Copy link
Contributor

skvrahul commented Dec 12, 2020

@Rufflewind I have tested this code snippet on the master branch and seems like it is not throwing the error.

@rhshadrach
Copy link
Member

I get dtypes changing on 1.1.5, but expected behavior on 1.0.x, 1.2, and master. Because this is a regression in 1.1.x and I think we're okay to add tests in the rc phase, I'm going to mark this as Needs Tests and label as 1.2.

@simonjayhawkins @jreback adjust if this is not correct.

@rhshadrach rhshadrach added Dtype Conversions Unexpected or buggy dtype conversions Needs Tests Unit test(s) needed to prevent regressions and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 12, 2020
@rhshadrach rhshadrach added this to the 1.2 milestone Dec 12, 2020
@IngErnestoAlvarez
Copy link

Then it's fixed on code? Cause I would like to make the tests and, if necessary, change the code.

@rhshadrach
Copy link
Member

@IngErnestoAlvarez That's correct; but tests still need to be added. A PR to do so is very much welcome!

@jordi-crespo
Copy link
Contributor

I would like to work on this issue

jordi-crespo added a commit to jordi-crespo/pandas that referenced this issue Dec 12, 2020
jordi-crespo added a commit to jordi-crespo/pandas that referenced this issue Dec 12, 2020
jordi-crespo added a commit to jordi-crespo/pandas that referenced this issue Dec 12, 2020
jordi-crespo added a commit to jordi-crespo/pandas that referenced this issue Dec 12, 2020
jordi-crespo added a commit to jordi-crespo/pandas that referenced this issue Dec 12, 2020
jordi-crespo added a commit to jordi-crespo/pandas that referenced this issue Dec 12, 2020
jordi-crespo added a commit to jordi-crespo/pandas that referenced this issue Dec 12, 2020
jordi-crespo added a commit to jordi-crespo/pandas that referenced this issue Dec 12, 2020
@arw2019
Copy link
Member

arw2019 commented Dec 12, 2020

@IngErnestoAlvarez @jordi-crespo if possible it's good to avoid overlapping PRs (since only one PR will be merged)

jordi-crespo added a commit to jordi-crespo/pandas that referenced this issue Dec 13, 2020
jordi-crespo added a commit to jordi-crespo/pandas that referenced this issue Dec 13, 2020
jordi-crespo added a commit to jordi-crespo/pandas that referenced this issue Dec 13, 2020
jordi-crespo added a commit to jordi-crespo/pandas that referenced this issue Dec 13, 2020
jordi-crespo added a commit to jordi-crespo/pandas that referenced this issue Dec 13, 2020
jordi-crespo added a commit to jordi-crespo/pandas that referenced this issue Dec 13, 2020
simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Dec 13, 2020
@simonjayhawkins
Copy link
Member

I get dtypes changing on 1.1.5, but expected behavior on 1.0.x, 1.2, and master.

first bad commit: [4edcc55] CLN: Make Series._values match Index._values (#31182)

need to also determine which commit fixed on master

simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Dec 13, 2020
@simonjayhawkins
Copy link
Member

need to also determine which commit fixed on master

[462b21d] Fix bug in combine first with string dtype and NA only in one level of index (#37568)

@jreback jreback modified the milestones: 1.2, 1.3 Dec 13, 2020
luckyvs1 pushed a commit to luckyvs1/pandas that referenced this issue Jan 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
8 participants