Skip to content

BUG: NBC change in reindexing logic of Series.__setitem__ in pandas=1.1.0 #37427

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
kozlov-alexey opened this issue Oct 26, 2020 · 9 comments · Fixed by #37502
Closed
2 of 3 tasks

BUG: NBC change in reindexing logic of Series.__setitem__ in pandas=1.1.0 #37427

kozlov-alexey opened this issue Oct 26, 2020 · 9 comments · Fixed by #37502
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@kozlov-alexey
Copy link

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.

Hi,

The following example of using Series.__setitem__ shows different results on pandas=1.0.5 and pandas=1.1.3:

import pandas as pd
import numpy as np

S1 = pd.Series([0, 1, -12, 3, -10, 5, 6, 7, 8, -11, -13], dtype=np.int32)
idx = pd.Series([4, 9, 2, 10], dtype=np.int64)
value = pd.Series([-10, -11, -12, -13], dtype=np.int32)
S1[idx] = value  
S1

It looks like #33643 changed the semantics of setitem operation when assigned values are pandas.Series, i.e. the old-behavior was to first align value with idx and then assign value items to S1 basing on idx.values as index labels, so result was:

>>> S1
0      0
1      1
2    -12
3      3
4    -10
5      5
6      6
7      7
8      8
9    -11
10   -13
dtype: int32

And now the result is as if S1[idx] took precedence before the actual assignment, so that labels in idx.values which are absent in value.index are assigned NaNs:

>>> S1
0      0.0
1      1.0
2    -12.0
3      3.0
4      NaN
5      5.0
6      6.0
7      7.0
8      8.0
9      NaN
10     NaN
dtype: float64

Can you please confirm this was intentional change (as I don't see any tests updated to reflect that, nor there are mentions in "What’s new in 1.1.0" section of the documentation)? From the user perspective it looks like with new implementation, user will have to ensure that value.index == idx.values to achieve the same result, which is not a big deal of course, but still it's some additional code.

First version where the issue appears:

INSTALLED VERSIONS
------------------
commit           : 4071c3b872e4ef98a2a2d7c752dd5d2030c4df6e
python           : 3.7.7.final.0
python-bits      : 64
OS               : Windows
OS-release       : 10
Version          : 10.0.18362
machine          : AMD64
processor        : Intel64 Family 6 Model 142 Stepping 10, GenuineIntel
byteorder        : little
LC_ALL           : None
LANG             : None
LOCALE           : None.None

pandas           : 1.1.0.dev0+1307.g4071c3b87

Kind Regards,
Alexey.

@kozlov-alexey kozlov-alexey added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 26, 2020
@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Oct 26, 2020

@kozlov-alexey Thanks a lot for the detailed report!

I think this wasn't an intentional change, at least it certainly seems so from the PR (as it was only meant as a code clean-up).

It's an unfortunate inconsistency between __getitem__ and loc, but not one we should just change like this I think, as this can easily (and silently) break your code.

cc @jbrockmendel

@jorisvandenbossche jorisvandenbossche added Indexing Related to indexing on series/frames, not to indexes themselves Regression Functionality that used to work in a prior pandas version and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 26, 2020
@jorisvandenbossche jorisvandenbossche added this to the 1.1.4 milestone Oct 26, 2020
@jbrockmendel
Copy link
Member

out fo curiosity, what does NBC stand for in this context?

@kozlov-alexey
Copy link
Author

kozlov-alexey commented Oct 28, 2020

A non-backward compatible change. I thought it's well-spread but it turns it doesn't)
And thanks for quick reply!

@simonjayhawkins simonjayhawkins modified the milestones: 1.1.4, 1.1.5 Oct 29, 2020
@simonjayhawkins
Copy link
Member

moved off 1.1.4 milestone (scheduled for release tomorrow) as no PRs to fix in the pipeline

@jorisvandenbossche
Copy link
Member

I put up a PR reverting part of the cleaning PR that caused this. If it's an acceptable fix, we can still merge by tomorrow morning: #37502

@jorisvandenbossche
Copy link
Member

I opened a follow-up issue to actually discuss which behaviour we want long term (the non-alignment of __getitem__, or the alignment of .loc): #37516

@kozlov-alexey
Copy link
Author

@jorisvandenbossche Hi, it seems the rollback in #37502 left few lines untouched, literally these:
https://github.com/jbrockmendel/pandas/blob/6a5c2429b140d7de5e1e04f6f5ed5d9351418848/pandas/core/series.py#L1034
https://github.com/jbrockmendel/pandas/blob/6a5c2429b140d7de5e1e04f6f5ed5d9351418848/pandas/core/series.py#L1077

and because of this same issue still appears with non-integer dtype keys. For example:

>>> pd.__version__
'1.2.0.dev0+1512.gec8240aa0'
>>> S = pd.Series(np.arange(11), index=['ab', 'ac', 'ad', 'a1', 'a2', 'a3', 'a ', 'ba', 'bc', 'bd', 'b1'])
>>> idx = pd.Series(['a2', 'bd', 'ad', 'b1'])
>>> set_value = pd.Series([-10, -11, -12, -13])
>>> S[idx] = set_value
>>> S
ab    0.0
ac    1.0
ad    NaN
a1    3.0
a2    NaN
a3    5.0
a     6.0
ba    7.0
bc    8.0
bd    NaN
b1    NaN
dtype: float64

Although there's ongoing discussion on what behavior is correct, can you maybe fully revert back to previous behavior for the time being?

Best Regards,
Alexey.

@jbrockmendel
Copy link
Member

any reason not to do S[idx] = np.array(set_value)?

@kozlov-alexey
Copy link
Author

kozlov-alexey commented Dec 11, 2020

@jbrockmendel That's possible of course, but it is still a workaround, and currently it will work with integer keys one way (i.e. old behavior of getitem) and with the rest of keys other way (new behavior of loc). Isn't it better to rollback back to old behavior for all cases?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants