BUG: merge left and merge inner produce different index-order #33554

rwijtvliet · 2020-04-14T23:21:35Z

According to the df.merge docstring and documentation, concerning the how parameter:

left: use only keys from left frame, similar to a SQL left outer join; preserve key order.
inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys.

(emphasis mine.) I understand this to mean, that the order of the left index will be preserved when using how='left' and how='inner' (or omitted). Of course, some keys might not be present in the second case, but that's beside the point here.

If my understanding is incorrect, my apologies and feel free to discard this report.

If my understanding is correct, however, I noticed today that this is not always true. Here is an example, where the order is not preserved.

Code Sample

import pandas as pd
import numpy as np

data = pd.DataFrame({'value':np.random.randint(-10,100,12)}, index=pd.date_range('2020-01-01', periods=12, freq='M'))
data['q'] = data.index.map(lambda ts: ts.quarter)
data['even'] = data.index.map(lambda ts: ts.month % 2 ==0)
cols = ['even', 'q']
av = data.groupby(cols).apply(lambda gr: gr[['value']].mean())
df1 = data.merge(av, how='inner', left_on=cols, suffixes=['', '_av'], right_index=True)
df2 = data.merge(av, how='left',  left_on=cols, suffixes=['', '_av'], right_index=True)

The dataframes:

data:
            value  q   even
2020-01-31     74  1  False
2020-02-29     87  1   True
2020-03-31     79  1  False
2020-04-30     74  2   True
2020-05-31     71  2  False
2020-06-30     80  2   True
2020-07-31     94  3  False
2020-08-31     19  3   True
2020-09-30     58  3  False
2020-10-31     97  4   True
2020-11-30      5  4  False
2020-12-31     16  4   True

av:
         value
even  q       
False 1   76.5
      2   71.0
      3   76.0
      4    5.0
True  1   87.0
      2   77.0
      3   19.0
      4   56.5

Output

df2: #as expected and wanted
            value  q   even  value_av
2020-01-31     74  1  False      76.5
2020-02-29     87  1   True      87.0
2020-03-31     79  1  False      76.5
2020-04-30     74  2   True      77.0
2020-05-31     71  2  False      71.0
2020-06-30     80  2   True      77.0
2020-07-31     94  3  False      76.0
2020-08-31     19  3   True      19.0
2020-09-30     58  3  False      76.0
2020-10-31     97  4   True      56.5
2020-11-30      5  4  False       5.0
2020-12-31     16  4   True      56.5

df1: #not as expected
            value  q   even  value_av
2020-01-31     74  1  False      76.5
2020-03-31     79  1  False      76.5
2020-02-29     87  1   True      87.0
2020-04-30     74  2   True      77.0
2020-06-30     80  2   True      77.0
2020-05-31     71  2  False      71.0
2020-07-31     94  3  False      76.0
2020-09-30     58  3  False      76.0
2020-08-31     19  3   True      19.0
2020-10-31     97  4   True      56.5
2020-12-31     16  4   True      56.5
2020-11-30      5  4  False       5.0

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : None
python : 3.7.4.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en
LOCALE : None.None

pandas : 1.0.3
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 42.0.2.post20191203
Cython : 0.29.15
pytest : 5.4.1
hypothesis : 5.8.3
sphinx : 2.4.4
blosc : None
feather : None
xlsxwriter : 1.2.8
lxml.etree : 4.5.0
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.1
IPython : 7.13.0
pandas_datareader: None
bs4 : 4.8.2
bottleneck : 1.3.2
fastparquet : None
gcsfs : None
lxml.etree : 4.5.0
matplotlib : None
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : None
pytables : None
pytest : 5.4.1
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.15
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.8
numba : 0.48.0

The text was updated successfully, but these errors were encountered:

dsaxton · 2020-04-15T01:35:17Z

Here's a smaller example:

import pandas as pd

left = pd.DataFrame({"a": [1, 2, 1]})
right = pd.DataFrame({"a": [1, 2]})
print(pd.merge(left, right, how="inner", on="a"))
#    a
# 0  1
# 1  1
# 2  2

Does look like the documentation isn't entirely accurate here.

rwijtvliet · 2020-04-15T07:51:53Z

Hey, thanks for the comment! That's great, as I tried to create a smaller example myself, but in all cases pandas happened to give the expected answer :)

rwijtvliet · 2020-04-15T11:39:18Z

Btw - is it a problem in the documentation, or rather in the implementation?

dsaxton · 2020-04-15T14:11:30Z

I suppose you could call it either, but personally I'd consider it more a documentation issue since it's promising behavior that's hard to define in general. In the example above, suppose the right DataFrame was instead

right = pd.DataFrame({"a": [1, 1, 2]})

so now the join is many-to-many. What does it mean to preserve order on the left? Are both [1, 1, 2, 1, 1] and [1, 1, 1, 2, 1] the "same" order as [1, 2, 1]? It's not as obvious when data is changing and wasn't ordered to begin with.

phofl · 2020-10-25T22:25:56Z

@dsaxton I think you could make a case for [1, 1, 2, 1, 1] in your example above. Taking every left key in the original order and matching it with values from the other df. Expescially since left produces this result (right too, should my pr get pulled). I would not expect [1, 1, 1, 2, 1]

rwijtvliet added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 14, 2020

dsaxton added Docs Reshaping Concat, Merge/Join, Stack/Unstack, Explode and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 15, 2020

simonjayhawkins mentioned this issue Jul 23, 2020

BUG: Inconsistent ordering of rows when merging with how=left and how=right #35382

Closed

3 tasks

chla0089 mentioned this issue Aug 10, 2022

BUG: merge: order when merging on index not preserved #48021

Closed

3 tasks

lukemanley mentioned this issue Aug 18, 2023

BUG: merge not always following documented sort behavior #54611

Merged

10 tasks

mroeschke closed this as completed in #54611 Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: merge left and merge inner produce different index-order #33554

BUG: merge left and merge inner produce different index-order #33554

rwijtvliet commented Apr 14, 2020

INSTALLED VERSIONS

dsaxton commented Apr 15, 2020

rwijtvliet commented Apr 15, 2020 •

edited

Loading

rwijtvliet commented Apr 15, 2020

dsaxton commented Apr 15, 2020

phofl commented Oct 25, 2020 •

edited

Loading

BUG: merge left and merge inner produce different index-order #33554

BUG: merge left and merge inner produce different index-order #33554

Comments

rwijtvliet commented Apr 14, 2020

Code Sample

Output

Output of pd.show_versions()

INSTALLED VERSIONS

dsaxton commented Apr 15, 2020

rwijtvliet commented Apr 15, 2020 • edited Loading

rwijtvliet commented Apr 15, 2020

dsaxton commented Apr 15, 2020

phofl commented Oct 25, 2020 • edited Loading

Output of `pd.show_versions()`

rwijtvliet commented Apr 15, 2020 •

edited

Loading

phofl commented Oct 25, 2020 •

edited

Loading