-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: Change in default column index dtype #51725
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@pandas-dev/pandas-core Labeling as 2.0 for now. |
The doc string for Series says regarding index in pandas v1.5:
and for DataFrame:
The first 3 examples you provided (a, b, c) all were provided columns indexes while the last one (d) wasn't, so only the last one fell back to use I wrote the PR is question (#49572) so possibly I'm feeling a sense of ownership towards this change, so I'll let other core devs decide on this. |
I totally agree this is more consistent, and there are benefits to that. There were definitely weird aspects to the old behavior. I just think the particular way consistency was achieved could use some more consideration. E.g. defaulting to strings for columns or using some "unititialzed index" value when nothing is provided. AFAICT In [1]: import pandas as pd
In [2]: pd.api.types.infer_dtype(pd.Index(["a"])[0:0])
Out[2]: 'empty'
In [3]: pd.api.types.infer_dtype(pd.Index([1])[0:0])
Out[3]: 'integer' I'm also totally on board with the default I'm just not as on board with the column change. I just saw it didn't get much discussion, think there could be better solutions, and think it's going to break a lot of people's test suites. Selfishly, I would like package authors upgrade path to be as smooth as possible – so other people don't pin pandas, and I don't have to deal with the resulting issues. |
* Add prerelease tests so we're testing against pandas 2.0.rc * Fix for pandas default column dtype change, pandas-dev/pandas#51725
Thanks for raising this @ivirshup. I do agree we should ensure changes like these are improving pandas and not just causing those downstream more work, but in my opinion that is the case here. I agree with @topper-123 - we're now more consistent and we agree with what the documented behavior was. I'm +1 on the change, but happy to consider any further discussion. |
OK to keep this and move it off the 2.0 milestone? |
There hasn't been much discussion, I'd be in favor of closing. If there is more discussion in the future, we can reopen. |
sounds good - closing for now then |
this now gives different results on main - pointing to the fact that this could require more clarity on tests/behavior. on main - the 1st and 4th examples now evaluate equal. 2nd and 3rd examples evaluate equal. In [16]: for k1, k2 in combinations(empty_dataframes, 2):
...: try:
...: pd.testing.assert_frame_equal(empty_dataframes[k1], empty_dataframes[k2])
...: except AssertionError:
...: print(k1, "!=", k2)
...:
a != b
a != c
b != d
c != d
In [25]: pd.__version__
Out[25]: '2.1.0.dev0+547.gc85bbc6913' |
there has been some discussion (mostly with @jorisvandenbossche) of introducing a empty/void dtype that would become (after a deprecation cycle) the default in reindexing and constructing Series/DataFrame without data. I think it would be reasonable to do the same for Index. In the branch I made trying to handle #51363 i think having a void-dtype empty Index would have been helpful. |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
Pre 2.0.0rc0, all the dataframes in the example compared as equal. In 2.0.0rc0:
Caused by:
This is more annoying than fundamentally breaking, as it's a manageable breaking change. However, AFAICT the breaking change to default column dtypes was not made to prevent bugs but for consistency, and without much discussion.
Because I think this will impact a large number of downstream project's test suites, I think it's worth having some discussion about this change. I also think there's probably a more reasonable default dtype for empty columns.
I think this change will break many test suites
As the PR here showed, this can be heavy on the test suite. In the library I was testing this release candidate against (anndata), I believe this change was responsible for 25 test failures.
When I search github, it looks like
assert_frame_equal
is used in about 34,000 python files (search link). I think this change will break many test suites.Semantics are strange, as "empty" is no longer equivalent to "no value provided"
Previously brought up:
As noted in the mentioned issue, the semantics here are strange. I certainly would not expect this to error:
Nor would I expect the example code (above) to error. I would think all of these DataFrame have no columns, so would compare as equal.
Even the docs for this breaking change equates
[]
with "empty-ness":RangeIndex is a reasonable default for
.index
. Probably not for.columns
.I think the popularity of the
tidyverse
andarrow
have shown people can go without/ don't care about row names, so defaulting to the most efficient representation here makes sense. I'm not sure this still holds for columns. RangeIndex would make a ton of sense as a default for an ndimensional library likexarray
.But this is a tabular library, and I think "default
string
dtype" is probably a much more reasonable default dtype for the columns. E.g.:pa.table([]).to_pandas()
.Also quite reasonable is
empty
, which was the "inferred dype" pre 2.0.0rc0.Expected Behavior
I think there's a bunch of different ways this could go.
TBH, I don't think the impact of any particular decision here would be huge, but that's kinda the point of my first possible alternative:
Reverting
Reverting is easy, but I do get that it's not as consistent.
I think it's worth considering the usefulness of this change relative to the impact on the upgrade process.
Columns default to the string data-type
I think this is pretty reasonable and would be quite consistent. It's possible this will also cause issues due to the changes happening around string dtypes at the moment.
Inferred dtype of default index value is "empty"
By which I mean the value returned from
pd.api.types.infer_dtype
, which is returning "empty" forpd.api.types.infer_dtype(pd.DataFrame({}).columns)
I'm not sure about the implications of this solution, so won't advocate for it too strongly. But
Empty indices behave as though their dtype was the bottom type in many operations
For example:
So maybe this could be formalized in some way.
Installed Versions
The text was updated successfully, but these errors were encountered: