Skip to content

BUG: merge left and merge inner produce different index-order #33554

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
rwijtvliet opened this issue Apr 14, 2020 · 5 comments · Fixed by #54611
Closed
2 of 3 tasks

BUG: merge left and merge inner produce different index-order #33554

rwijtvliet opened this issue Apr 14, 2020 · 5 comments · Fixed by #54611
Labels
Docs Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@rwijtvliet
Copy link

According to the df.merge docstring and documentation, concerning the how parameter:

left: use only keys from left frame, similar to a SQL left outer join; preserve key order.
inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys.

(emphasis mine.) I understand this to mean, that the order of the left index will be preserved when using how='left' and how='inner' (or omitted). Of course, some keys might not be present in the second case, but that's beside the point here.

If my understanding is incorrect, my apologies and feel free to discard this report.

If my understanding is correct, however, I noticed today that this is not always true. Here is an example, where the order is not preserved.

Code Sample

import pandas as pd
import numpy as np

data = pd.DataFrame({'value':np.random.randint(-10,100,12)}, index=pd.date_range('2020-01-01', periods=12, freq='M'))
data['q'] = data.index.map(lambda ts: ts.quarter)
data['even'] = data.index.map(lambda ts: ts.month % 2 ==0)
cols = ['even', 'q']
av = data.groupby(cols).apply(lambda gr: gr[['value']].mean())
df1 = data.merge(av, how='inner', left_on=cols, suffixes=['', '_av'], right_index=True)
df2 = data.merge(av, how='left',  left_on=cols, suffixes=['', '_av'], right_index=True)

The dataframes:

data:
            value  q   even
2020-01-31     74  1  False
2020-02-29     87  1   True
2020-03-31     79  1  False
2020-04-30     74  2   True
2020-05-31     71  2  False
2020-06-30     80  2   True
2020-07-31     94  3  False
2020-08-31     19  3   True
2020-09-30     58  3  False
2020-10-31     97  4   True
2020-11-30      5  4  False
2020-12-31     16  4   True

av:
         value
even  q       
False 1   76.5
      2   71.0
      3   76.0
      4    5.0
True  1   87.0
      2   77.0
      3   19.0
      4   56.5

Output

df2: #as expected and wanted
            value  q   even  value_av
2020-01-31     74  1  False      76.5
2020-02-29     87  1   True      87.0
2020-03-31     79  1  False      76.5
2020-04-30     74  2   True      77.0
2020-05-31     71  2  False      71.0
2020-06-30     80  2   True      77.0
2020-07-31     94  3  False      76.0
2020-08-31     19  3   True      19.0
2020-09-30     58  3  False      76.0
2020-10-31     97  4   True      56.5
2020-11-30      5  4  False       5.0
2020-12-31     16  4   True      56.5

df1: #not as expected
            value  q   even  value_av
2020-01-31     74  1  False      76.5
2020-03-31     79  1  False      76.5
2020-02-29     87  1   True      87.0
2020-04-30     74  2   True      77.0
2020-06-30     80  2   True      77.0
2020-05-31     71  2  False      71.0
2020-07-31     94  3  False      76.0
2020-09-30     58  3  False      76.0
2020-08-31     19  3   True      19.0
2020-10-31     97  4   True      56.5
2020-12-31     16  4   True      56.5
2020-11-30      5  4  False       5.0
  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.4.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en
LOCALE : None.None

pandas : 1.0.3
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 42.0.2.post20191203
Cython : 0.29.15
pytest : 5.4.1
hypothesis : 5.8.3
sphinx : 2.4.4
blosc : None
feather : None
xlsxwriter : 1.2.8
lxml.etree : 4.5.0
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.1
IPython : 7.13.0
pandas_datareader: None
bs4 : 4.8.2
bottleneck : 1.3.2
fastparquet : None
gcsfs : None
lxml.etree : 4.5.0
matplotlib : None
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : None
pytables : None
pytest : 5.4.1
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.15
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.8
numba : 0.48.0

@rwijtvliet rwijtvliet added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 14, 2020
@dsaxton
Copy link
Member

dsaxton commented Apr 15, 2020

Here's a smaller example:

import pandas as pd

left = pd.DataFrame({"a": [1, 2, 1]})
right = pd.DataFrame({"a": [1, 2]})
print(pd.merge(left, right, how="inner", on="a"))
#    a
# 0  1
# 1  1
# 2  2

Does look like the documentation isn't entirely accurate here.

@dsaxton dsaxton added Docs Reshaping Concat, Merge/Join, Stack/Unstack, Explode and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 15, 2020
@rwijtvliet
Copy link
Author

rwijtvliet commented Apr 15, 2020

Hey, thanks for the comment! That's great, as I tried to create a smaller example myself, but in all cases pandas happened to give the expected answer :)

@rwijtvliet
Copy link
Author

Btw - is it a problem in the documentation, or rather in the implementation?

@dsaxton
Copy link
Member

dsaxton commented Apr 15, 2020

I suppose you could call it either, but personally I'd consider it more a documentation issue since it's promising behavior that's hard to define in general. In the example above, suppose the right DataFrame was instead

right = pd.DataFrame({"a": [1, 1, 2]})

so now the join is many-to-many. What does it mean to preserve order on the left? Are both [1, 1, 2, 1, 1] and [1, 1, 1, 2, 1] the "same" order as [1, 2, 1]? It's not as obvious when data is changing and wasn't ordered to begin with.

@phofl
Copy link
Member

phofl commented Oct 25, 2020

@dsaxton I think you could make a case for [1, 1, 2, 1, 1] in your example above. Taking every left key in the original order and matching it with values from the other df. Expescially since left produces this result (right too, should my pr get pulled). I would not expect [1, 1, 1, 2, 1]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants