Skip to content

BUG? merging on column of empty frame with index of right frame #15692

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jorisvandenbossche opened this issue Mar 15, 2017 · 10 comments
Open
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@jorisvandenbossche
Copy link
Member

It is a rather specific corner case, but there has been a change in behaviour when merging an empty frame:

In [1]: pd.__version__
Out[1]: '0.19.2'

In [2]: left = pd.DataFrame(columns=['key', 'col_left'])

In [3]: left
Out[3]: 
Empty DataFrame
Columns: [key, col_left]
Index: []

In [4]: right = pd.DataFrame({'col_right': ['a', 'b', 'c']})

In [5]: right
Out[5]: 
  col_right
0         a
1         b
2         c

In [6]: left.merge(right, left_on='key', right_index=True, how="right")
Out[6]: 
   key col_left col_right
0    0      NaN         a
1    1      NaN         b
2    2      NaN         c

vs

In [10]: pd.__version__
Out[10]: u'0.18.1'

In [11]: left = pd.DataFrame(columns=['key', 'col_left'])

In [12]: left
Out[12]: 
Empty DataFrame
Columns: [key, col_left]
Index: []

In [13]: right = pd.DataFrame({'col_right': ['a', 'b', 'c']})

In [14]: right
Out[14]: 
  col_right
0         a
1         b
2         c

In [15]: left.merge(right, left_on='key', right_index=True, how="right")
Out[15]: 
   key col_left col_right
0  NaN      NaN         a
1  NaN      NaN         b
2  NaN      NaN         c

So with 0.19 the 'key' column has values, in 0.18 this holds NaNs. The key column comes from the empty frame (so it had no values, how can it have values now?), but is merged with the index of the left frame (and this has of course values -> should these end up in the 'key' column of the resulting frame?)
It is such a strange case, that I am actually not sure which of both is the expected behaviour .. (and also not sure if this was an intentional change in behaviour).

Encountered here: geopandas/geopandas#422

@jreback
Copy link
Contributor

jreback commented Mar 15, 2017

changed here: e8d9e79

this tries to coerce keys back to the original dtype, might be buggy

@jorisvandenbossche
Copy link
Member Author

So the question is, when not merging on a common column (in this case left:column and right:index), do we want both in the output, and should both be unique? (or only unique in the overlapping values?)

Some further exploration (with the same example as above):

Settting the right index to other values makes it clear the index and key column duplicates the 'merge' values:

In [7]: right = pd.DataFrame({'col_right': ['a', 'b', 'c']}, index=[2,3,4])

In [8]: left.merge(right, left_on='key', right_index=True, how="right")
Out[8]: 
   key col_left col_right
2    2      NaN         a
3    3      NaN         b
4    4      NaN         c

When the left frame is not fully empty, but just missing values in comparison to right frame, you also get strange behaviour:

In [10]: left = pd.DataFrame({'key':[3], 'col_left':['e']})

In [11]: left
Out[11]: 
  col_left  key
0        e    3

In [12]: left.merge(right, left_on='key', right_index=True, how="right")
Out[12]: 
  col_left  key col_right
0        e    3         b
0      NaN    2         a
0      NaN    4         c

The 'key' column is also filled like before, but now the index is gone (and not even default values, but all 0's)

@jorisvandenbossche
Copy link
Member Author

Behaviour in 0.18 of the above:

In [18]: left = pd.DataFrame({'key':[3], 'col_left':['e']})

In [19]: left.merge(right, left_on='key', right_index=True, how="right")
Out[19]: 
  col_left  key col_right
0        e  3.0         b
0      NaN  2.0         a
0      NaN  4.0         c

So the bug with the 0's in the index is the same, and also here the key column gets filled in (only not retained dtype, that is what was fixed in 0.19). So probably the filling of the key column is then the desired behaviour ?

@jorisvandenbossche
Copy link
Member Author

The behaviour of the resulting index is rather buggy:

In [15]: left
Out[15]: 
  col_left  key
0        e    3

## All zero's

In [16]: left.merge(right, left_on='key', right_index=True, how="right")
Out[16]: 
  col_left  key col_right
0        e    3         b
0      NaN    2         a
0      NaN    4         c

In [17]: left = pd.DataFrame({'key':[2,3,4], 'col_left':['e', 'd', 'g']})

## Incorrect "default" index (or not sure where that values are coming from)

In [18]: left.merge(right, left_on='key', right_index=True, how="right")
Out[18]: 
  col_left  key col_right
0        e    2         a
1        d    3         b
2        g    4         c

## This seems correct?

In [19]: left = pd.DataFrame({'key':[0,1,2,3,4,5], 'col_left':['e', 'd', 'g', 'h
    ...: ', 'b', 'p']})

In [20]: left.merge(right, left_on='key', right_index=True, how="right")
Out[20]: 
  col_left  key col_right
2        g    2         a
3        h    3         b
4        b    4         c

@jreback
Copy link
Contributor

jreback commented Mar 15, 2017

yeah I would say the index of the result is wrong. It should be just a range index. Maybe not getting setup somehow.

In [10]: left
Out[10]: 
  col_left  key
0        e    3

In [11]: right
Out[11]: 
  col_right
2         a
3         b
4         c

In [12]: left.merge(right, left_on='key', right_index=True, how="right", indicator=True)
Out[12]: 
  col_left  key col_right      _merge
0        e    3         b        both
0      NaN    2         a  right_only
0      NaN    4         c  right_only

@jreback jreback added Bug Difficulty Intermediate Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Mar 15, 2017
@jreback jreback added this to the Next Major Release milestone Mar 15, 2017
@jorisvandenbossche
Copy link
Member Author

It should be just a range index.

It is a right join on the index (at least on the right index), so an option is also to preserve the right index. It makes that you have the merge values duplicate (in column and in index), but the same is true when merging on two different named columns.

BTW, when merging on two different columns, the key columns also don't get filled in on non-matching rows:

In [33]: left = pd.DataFrame({'key_left': [3], 'col_left': ['e']})

In [34]: right = pd.DataFrame({'col_right': ['a', 'b', 'c'], 'key_right':[2,3,4]
    ...: })

In [35]: left.merge(right, left_on='key_left', right_on='key_right', how='right'
    ...: )
Out[35]: 
  col_left  key_left col_right  key_right
0        e       3.0         b          3
1      NaN       NaN         a          2
2      NaN       NaN         c          4

So this is a further inconsistency with the left_on='key', right_index=True case.

@randomgambit
Copy link

seems to work correctly when using dates for the keys ... except for the weird index.


left = pd.DataFrame({'key':[pd.to_datetime('2016-01-02')], 'col_left':['e']})
left
Out[78]: 
  col_left        key
0        e 2016-01-02

right = pd.DataFrame({'col_right': ['a', 'b', 'c']}, index=pd.date_range('2016-01-01', periods =3, freq = 'D'))
right
Out[79]: 
           col_right
2016-01-01         a
2016-01-02         b
2016-01-03         c


left.merge(right, how = 'right', left_on = 'key', right_index = True)
Out[80]: 
  col_left        key col_right
0        e 2016-01-02         b
0      NaN 2016-01-01         a
0      NaN 2016-01-03         c

#now works correctly with reset_index
left.merge(right.reset_index(), how = 'right', left_on = 'key', right_on = 'index')
Out[81]: 
  col_left        key      index col_right
0        e 2016-01-02 2016-01-02         b
1      NaN        NaT 2016-01-01         a
2      NaN        NaT 2016-01-03         c

@jorisvandenbossche
Copy link
Member Author

@randomgambit sorry, I don't see how this example is different from the example without dates. The buggy index (all 0's) is exactly one of the issues.

@randomgambit
Copy link

hello cher monsieur @jorisvandenbossche !

yes you are actually right. I misread my console. Sorry about that. Well, I guess my point is that this bug carries over also for datetime variables :) I also noticed this pb a couple of times, and I always do my merges after having reset the index anyways

@jorisvandenbossche
Copy link
Member Author

No problem!
Yes, indeed, the inconsistencies / special cases are mainly caused by mixing merging on column and index.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants