Skip to content

API: allow Series comparison ops to align before comparison (GH1134) #6860

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Apr 10, 2014

closes #1134

reordered comparisons

In [1]:     s1 = Series(index=["A", "B", "C"], data=[1,2,3])

In [2]:      s1
Out[2]: 
A    1
B    2
C    3
dtype: int64

In [3]:      s2 = Series(index=["C", "B", "A"], data=[3,2,1])

In [4]:      s2
Out[4]: 
C    3
B    2
A    1
dtype: int64

In [5]:      s1 == s2
Out[5]: 
A    True
B    True
C    True
dtype: bool

Here we have a missing value, so it's nan in the comparisons

In [6]:      s3 = Series(index=["C", "B"], data=[3,2])

In [7]:      s3
Out[7]: 
C    3
B    2
dtype: int64

In [8]:      s1 == s3
Out[8]: 
A    False
B     True
C     True
dtype: bool

In [9]: s1>s3
Out[9]: 
A    False
B    False
C    False
dtype: bool

In [10]: s1<s3
Out[10]: 
A    False
B    False
C    False
dtype: bool

@jreback jreback added this to the 0.14.0 milestone Apr 10, 2014
@jreback
Copy link
Contributor Author

jreback commented Apr 10, 2014

IIRC we discussed this ad nauseum before. Its more 'correct' for the missing values to return nan, (so the resulting Series is not boolean but object), and thus requires filling before doing indexing. So we are effectively filling in False here (when their are nans).

Furthermore, a reordered Series is really NOT equal.

@jreback
Copy link
Contributor Author

jreback commented Apr 12, 2014

cc @Komnomnomnom

This is from test_json/test_ujson/testSeries

This is a failing test with this PR, because before the values DO compare correctly if you didn't align the indexes. Aligning causes this to fail (as nothing matches up as 1 is Int64, the other object).

This DOES look correct though as deserializing does not guarantee that something that looks like a numerical index is actually numerical, right? (except for DatetimeIndex and we have a separate kw arg fo that).

right?

In [1]: s = Series([10, 20, 30, 40, 50, 60], name="series", index=[6,7,8,9,10,15])

In [2]: s.sort()

In [3]: import pandas.json as ujson

In [4]: outp = Series(ujson.decode(ujson.encode(s)))

In [6]: outp.sort()

In [7]: outp
Out[7]: 
6     10
7     20
8     30
9     40
10    50
15    60
dtype: int64

In [8]: outp.index
Out[8]: Index([u'6', u'7', u'8', u'9', u'10', u'15'], dtype='object')

@Komnomnomnom
Copy link
Contributor

Yeah the problem with JSON is keys must be strings so when you read them back you really have no idea without doing some guesswork (which is what read_json does after calling decode / loads).

The lower level 'decode' method which this is testing gives you a string index back, which didn't matter during comparison before as there was no alignment happening. Your fix looks good, although it would work just as well I think to change the test Series to have a string index from the start e.g.

In [20]: s = Series([10, 20, 30, 40, 50, 60], name="series", index=[str(s) for s in [6,7,8,9,10,15]])
In [26]: Series(ujson.decode(ujson.encode(s))).index
Out[26]: Index([u'10', u'15', u'6', u'7', u'8', u'9'], dtype='object')

@jorisvandenbossche
Copy link
Member

@jreback Just wondering, but it would also be an option to only let Series.eq (and the other methods) do this flexible comparison with alignment, and let the == non-flexible.

Because with this change you have df == df being non-flexible (not aligning) and demanding identical indices, while s == s is flexible/does align. Which is also a confusing inconsistency? Or is there a good reason for that?
But it is also confusing that s + s does align and s == s does not (but the same holds for a dataframe, and I would rather keep consistency within one operator).

@cpcloud
Copy link
Member

cpcloud commented Apr 12, 2014

+1 on @jorisvandenbossche's suggestion: named methods flexible, corresponding syntax is not. I personally find the unaligned error a useful sanity check.

@jreback jreback modified the milestones: 0.15.0, 0.14.0 Apr 28, 2014
@jreback
Copy link
Contributor Author

jreback commented Apr 28, 2014

going to bump; can work on in next version

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Enhancement Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pandas.Series.__eq__ is broken for series with different index
4 participants