BUG? merging on column of empty frame with index of right frame #15692

jorisvandenbossche · 2017-03-15T14:43:15Z

It is a rather specific corner case, but there has been a change in behaviour when merging an empty frame:

In [1]: pd.__version__
Out[1]: '0.19.2'

In [2]: left = pd.DataFrame(columns=['key', 'col_left'])

In [3]: left
Out[3]: 
Empty DataFrame
Columns: [key, col_left]
Index: []

In [4]: right = pd.DataFrame({'col_right': ['a', 'b', 'c']})

In [5]: right
Out[5]: 
  col_right
0         a
1         b
2         c

In [6]: left.merge(right, left_on='key', right_index=True, how="right")
Out[6]: 
   key col_left col_right
0    0      NaN         a
1    1      NaN         b
2    2      NaN         c

vs

In [10]: pd.__version__
Out[10]: u'0.18.1'

In [11]: left = pd.DataFrame(columns=['key', 'col_left'])

In [12]: left
Out[12]: 
Empty DataFrame
Columns: [key, col_left]
Index: []

In [13]: right = pd.DataFrame({'col_right': ['a', 'b', 'c']})

In [14]: right
Out[14]: 
  col_right
0         a
1         b
2         c

In [15]: left.merge(right, left_on='key', right_index=True, how="right")
Out[15]: 
   key col_left col_right
0  NaN      NaN         a
1  NaN      NaN         b
2  NaN      NaN         c

So with 0.19 the 'key' column has values, in 0.18 this holds NaNs. The key column comes from the empty frame (so it had no values, how can it have values now?), but is merged with the index of the left frame (and this has of course values -> should these end up in the 'key' column of the resulting frame?)
It is such a strange case, that I am actually not sure which of both is the expected behaviour .. (and also not sure if this was an intentional change in behaviour).

Encountered here: geopandas/geopandas#422

The text was updated successfully, but these errors were encountered:

jreback · 2017-03-15T14:54:11Z

changed here: e8d9e79

this tries to coerce keys back to the original dtype, might be buggy

jorisvandenbossche · 2017-03-15T15:01:20Z

So the question is, when not merging on a common column (in this case left:column and right:index), do we want both in the output, and should both be unique? (or only unique in the overlapping values?)

Some further exploration (with the same example as above):

Settting the right index to other values makes it clear the index and key column duplicates the 'merge' values:

In [7]: right = pd.DataFrame({'col_right': ['a', 'b', 'c']}, index=[2,3,4])

In [8]: left.merge(right, left_on='key', right_index=True, how="right")
Out[8]: 
   key col_left col_right
2    2      NaN         a
3    3      NaN         b
4    4      NaN         c

When the left frame is not fully empty, but just missing values in comparison to right frame, you also get strange behaviour:

In [10]: left = pd.DataFrame({'key':[3], 'col_left':['e']})

In [11]: left
Out[11]: 
  col_left  key
0        e    3

In [12]: left.merge(right, left_on='key', right_index=True, how="right")
Out[12]: 
  col_left  key col_right
0        e    3         b
0      NaN    2         a
0      NaN    4         c

The 'key' column is also filled like before, but now the index is gone (and not even default values, but all 0's)

jorisvandenbossche · 2017-03-15T15:07:44Z

Behaviour in 0.18 of the above:

In [18]: left = pd.DataFrame({'key':[3], 'col_left':['e']})

In [19]: left.merge(right, left_on='key', right_index=True, how="right")
Out[19]: 
  col_left  key col_right
0        e  3.0         b
0      NaN  2.0         a
0      NaN  4.0         c

So the bug with the 0's in the index is the same, and also here the key column gets filled in (only not retained dtype, that is what was fixed in 0.19). So probably the filling of the key column is then the desired behaviour ?

jorisvandenbossche · 2017-03-15T15:08:54Z

The behaviour of the resulting index is rather buggy:

In [15]: left
Out[15]: 
  col_left  key
0        e    3

## All zero's

In [16]: left.merge(right, left_on='key', right_index=True, how="right")
Out[16]: 
  col_left  key col_right
0        e    3         b
0      NaN    2         a
0      NaN    4         c

In [17]: left = pd.DataFrame({'key':[2,3,4], 'col_left':['e', 'd', 'g']})

## Incorrect "default" index (or not sure where that values are coming from)

In [18]: left.merge(right, left_on='key', right_index=True, how="right")
Out[18]: 
  col_left  key col_right
0        e    2         a
1        d    3         b
2        g    4         c

## This seems correct?

In [19]: left = pd.DataFrame({'key':[0,1,2,3,4,5], 'col_left':['e', 'd', 'g', 'h
    ...: ', 'b', 'p']})

In [20]: left.merge(right, left_on='key', right_index=True, how="right")
Out[20]: 
  col_left  key col_right
2        g    2         a
3        h    3         b
4        b    4         c

jreback · 2017-03-15T15:10:47Z

yeah I would say the index of the result is wrong. It should be just a range index. Maybe not getting setup somehow.

In [10]: left
Out[10]: 
  col_left  key
0        e    3

In [11]: right
Out[11]: 
  col_right
2         a
3         b
4         c

In [12]: left.merge(right, left_on='key', right_index=True, how="right", indicator=True)
Out[12]: 
  col_left  key col_right      _merge
0        e    3         b        both
0      NaN    2         a  right_only
0      NaN    4         c  right_only

jorisvandenbossche · 2017-03-15T15:20:42Z

It should be just a range index.

It is a right join on the index (at least on the right index), so an option is also to preserve the right index. It makes that you have the merge values duplicate (in column and in index), but the same is true when merging on two different named columns.

BTW, when merging on two different columns, the key columns also don't get filled in on non-matching rows:

In [33]: left = pd.DataFrame({'key_left': [3], 'col_left': ['e']})

In [34]: right = pd.DataFrame({'col_right': ['a', 'b', 'c'], 'key_right':[2,3,4]
    ...: })

In [35]: left.merge(right, left_on='key_left', right_on='key_right', how='right'
    ...: )
Out[35]: 
  col_left  key_left col_right  key_right
0        e       3.0         b          3
1      NaN       NaN         a          2
2      NaN       NaN         c          4

So this is a further inconsistency with the left_on='key', right_index=True case.

randomgambit · 2017-03-21T19:59:05Z

seems to work correctly when using dates for the keys ... except for the weird index.


left = pd.DataFrame({'key':[pd.to_datetime('2016-01-02')], 'col_left':['e']})
left
Out[78]: 
  col_left        key
0        e 2016-01-02

right = pd.DataFrame({'col_right': ['a', 'b', 'c']}, index=pd.date_range('2016-01-01', periods =3, freq = 'D'))
right
Out[79]: 
           col_right
2016-01-01         a
2016-01-02         b
2016-01-03         c


left.merge(right, how = 'right', left_on = 'key', right_index = True)
Out[80]: 
  col_left        key col_right
0        e 2016-01-02         b
0      NaN 2016-01-01         a
0      NaN 2016-01-03         c

#now works correctly with reset_index
left.merge(right.reset_index(), how = 'right', left_on = 'key', right_on = 'index')
Out[81]: 
  col_left        key      index col_right
0        e 2016-01-02 2016-01-02         b
1      NaN        NaT 2016-01-01         a
2      NaN        NaT 2016-01-03         c

jorisvandenbossche · 2017-03-21T22:12:13Z

@randomgambit sorry, I don't see how this example is different from the example without dates. The buggy index (all 0's) is exactly one of the issues.

randomgambit · 2017-03-21T22:21:26Z

hello cher monsieur @jorisvandenbossche !

yes you are actually right. I misread my console. Sorry about that. Well, I guess my point is that this bug carries over also for datetime variables :) I also noticed this pb a couple of times, and I always do my merges after having reset the index anyways

jorisvandenbossche · 2017-03-21T22:44:48Z

No problem!
Yes, indeed, the inconsistencies / special cases are mainly caused by mixing merging on column and index.

jreback added Bug Difficulty Intermediate Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Mar 15, 2017

jreback added this to the Next Major Release milestone Mar 15, 2017

jbrockmendel removed Effort Medium labels Oct 21, 2019

phofl mentioned this issue May 31, 2020

BUG: Fix value setting in case of merging via index on one side and column on other side. #34496

Closed

9 tasks

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

debnathshoham mentioned this issue Oct 28, 2022

BUG: join on column with index #49360

Closed

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG? merging on column of empty frame with index of right frame #15692

BUG? merging on column of empty frame with index of right frame #15692

jorisvandenbossche commented Mar 15, 2017

jreback commented Mar 15, 2017

jorisvandenbossche commented Mar 15, 2017

jorisvandenbossche commented Mar 15, 2017

jorisvandenbossche commented Mar 15, 2017

jreback commented Mar 15, 2017

jorisvandenbossche commented Mar 15, 2017

randomgambit commented Mar 21, 2017

jorisvandenbossche commented Mar 21, 2017

randomgambit commented Mar 21, 2017

jorisvandenbossche commented Mar 21, 2017

BUG? merging on column of empty frame with index of right frame #15692

BUG? merging on column of empty frame with index of right frame #15692

Comments

jorisvandenbossche commented Mar 15, 2017

jreback commented Mar 15, 2017

jorisvandenbossche commented Mar 15, 2017

jorisvandenbossche commented Mar 15, 2017

jorisvandenbossche commented Mar 15, 2017

jreback commented Mar 15, 2017

jorisvandenbossche commented Mar 15, 2017

randomgambit commented Mar 21, 2017

jorisvandenbossche commented Mar 21, 2017

randomgambit commented Mar 21, 2017

jorisvandenbossche commented Mar 21, 2017