Fix pd.merge to preserve ExtensionArrays dtypes #20745

jorisvandenbossche · 2018-04-19T14:09:48Z

Closes #20743

jorisvandenbossche · 2018-04-19T14:11:00Z

pandas/core/internals.py

@@ -5541,7 +5541,7 @@ def concatenate_join_units(join_units, concat_axis, copy):
    if len(to_concat) == 1:
        # Only one block, nothing to concatenate.
        concat_values = to_concat[0]
-        if copy and concat_values.base is not None:
+        if copy and getattr(concat_values, 'base', 1) is not None:


This is kind of a hack, would like to have a better solution.

(related to discussion earlier this day in #20721 (comment) about deprecating Index.base)

you should only check base if concat_values is an ndarray

@jreback updated

jorisvandenbossche · 2018-04-19T14:58:54Z

Note to self: 1 actual failing test (rest are network failures):

___________________________ TestReshaping.test_merge ___________________________
[gw1] linux -- Python 3.6.5 /home/travis/miniconda3/envs/pandas/bin/python
self = <pandas.tests.extension.json.test_json.TestReshaping object at 0x7fb92d208dd8>
data = JSONArary([{'D': 91, 'k': 84, 'L': 34, 'm': 63, 'C': 8, 'O': 1, 'R': 5}, {'T': 49, 'o': 46}, {'g': 81, 'O': 8, 'p': 54...': 32, 'p': 0, 'J': 89, 'd': 32, 's': 22}, {'W': 70, 'Y': 68, 'D': 87, 'k': 52}, {'F': 29, 'h': 10, 'S': 41, 'y': 24}])
na_value = {}
    def test_merge(self, data, na_value):
    
        df1 = pd.DataFrame({'int1': [1, 2, 3], 'key': [0, 1, 2],
                            'ext': data[:3]})
        df2 = pd.DataFrame({'int2': [1, 2, 3, 4], 'key': [0, 0, 1, 3]})
    
        res = pd.merge(df1, df2)
        exp = pd.DataFrame(
            {'int1': [1, 1, 2], 'int2': [1, 2, 3], 'key': [0, 0, 1],
             'ext': data._constructor_from_sequence(
                 [data[0], data[0], data[1]])})
>       self.assert_frame_equal(res, exp[['ext', 'int1', 'key', 'int2']])
pandas/tests/extension/base/reshaping.py:110: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

E       AssertionError: DataFrame.columns are different
E       
E       DataFrame.columns values are different (75.0 %)
E       [left]:  Index(['int1', 'key', 'ext', 'int2'], dtype='object')
E       [right]: Index(['ext', 'int1', 'key', 'int2'], dtype='object')

Not seeing this locally, and also not failing for Decimal, so the column order seems to depend on the extension dtype? (which is a bit strange)

jreback · 2018-04-19T15:53:34Z

pandas/core/internals.py

@@ -5541,7 +5541,7 @@ def concatenate_join_units(join_units, concat_axis, copy):
    if len(to_concat) == 1:
        # Only one block, nothing to concatenate.
        concat_values = to_concat[0]
-        if copy and concat_values.base is not None:
+        if copy and getattr(concat_values, 'base', 1) is not None:


you should only check base if concat_values is an ndarray

jreback · 2018-04-19T15:53:47Z

pandas/tests/extension/base/reshaping.py

@@ -95,3 +95,24 @@ def test_set_frame_overwrite_object(self, data):
        df = pd.DataFrame({"A": [1] * len(data)}, dtype=object)
        df['A'] = data
        assert df.dtypes['A'] == data.dtype
+
+    def test_merge(self, data, na_value):
+


issue number

TomAugspurger · 2018-04-20T11:33:42Z

I can reproduce your failure locally. Looking into it now.

TomAugspurger · 2018-04-20T11:54:29Z

Ah, bug in BaseDecimal.assert_frame_equal not checking order of all the columns. It only considered extension & non extension columns independently.

So the test is broken for all types, not just JSON.

The Decimal base class failed to check the global order of columns. Fixed that as well.

TomAugspurger · 2018-04-20T12:04:26Z

Pushed a commit fixing the Decimal tests base class and adjust the expected order. Hopefully this will pass things.

codecov · 2018-04-20T15:28:32Z

Codecov Report

Merging #20745 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #20745      +/-   ##
==========================================
+ Coverage   91.84%   91.85%   +<.01%     
==========================================
  Files         153      153              
  Lines       49303    49307       +4     
==========================================
+ Hits        45282    45289       +7     
+ Misses       4021     4018       -3

Flag	Coverage Δ
#multiple	`90.24% <100%> (ø)`	⬆️
#single	`41.89% <66.66%> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/internals.py	`95.57% <100%> (+0.03%)`	⬆️
pandas/core/dtypes/common.py	`94.64% <100%> (ø)`	⬆️
pandas/core/indexes/datetimelike.py	`96.72% <0%> (ø)`	⬆️
pandas/core/frame.py	`97.16% <0%> (ø)`	⬆️
pandas/util/testing.py	`84.79% <0%> (+0.2%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 78fee04...9cf8cfe. Read the comment docs.

jorisvandenbossche · 2018-04-20T15:33:51Z

Thanks for the fix! The order was indeed the culprit, but changed it again to be alphabetically in the original test frame, so it behaves the same for python < 3.6

jorisvandenbossche · 2018-04-20T16:34:11Z

For Travis, the remaining failure is an s3 one, so this seems to be passing now.

jreback · 2018-04-21T16:43:30Z

pandas/tests/extension/base/reshaping.py

@@ -95,3 +95,24 @@ def test_set_frame_overwrite_object(self, data):
        df = pd.DataFrame({"A": [1] * len(data)}, dtype=object)
        df['A'] = data
        assert df.dtypes['A'] == data.dtype
+
+    def test_merge(self, data, na_value):


you prob should test with with the how=join_type fixture.

They both need a different expected result, I don't think it is really worth it here in this case?
(the test is also not meant to be a full cover of the merge function (for that we already have other tests), just to test that basic use cases of concatting works with extension arrays)

ok, its worth doing way more tests than a single usecase. but ok here I guess.

Fix pd.merge to preserve ExtensionArrays dtypes

73d64eb

jorisvandenbossche added the ExtensionArray Extending pandas with custom dtypes or arrays. label Apr 19, 2018

jorisvandenbossche added this to the 0.23.0 milestone Apr 19, 2018

jorisvandenbossche commented Apr 19, 2018

View reviewed changes

jreback requested changes Apr 19, 2018

View reviewed changes

jorisvandenbossche mentioned this pull request Apr 19, 2018

DEPR: Series ndarray properties (strides, data, base, itemsize, flags) #20721

Merged

4 tasks

TomAugspurger added 2 commits April 20, 2018 07:02

Fixed order.

716e928

The Decimal base class failed to check the global order of columns. Fixed that as well.

Added issue number

8824a47

jorisvandenbossche added 2 commits April 20, 2018 17:24

copy: check for arrays

884510c

change order in tests for python < 3.6

9cf8cfe

TomAugspurger approved these changes Apr 20, 2018

View reviewed changes

jreback requested changes Apr 21, 2018

View reviewed changes

jreback approved these changes Apr 22, 2018

View reviewed changes

jreback merged commit 0ae7e90 into pandas-dev:master Apr 22, 2018

Uh oh!

Fix pd.merge to preserve ExtensionArrays dtypes #20745

Fix pd.merge to preserve ExtensionArrays dtypes #20745

Uh oh!

Conversation

jorisvandenbossche commented Apr 19, 2018

Uh oh!

jorisvandenbossche Apr 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback Apr 19, 2018

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Apr 20, 2018

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented Apr 19, 2018

Uh oh!

jreback Apr 19, 2018

Choose a reason for hiding this comment

Uh oh!

jreback Apr 19, 2018

Choose a reason for hiding this comment

Uh oh!

TomAugspurger commented Apr 20, 2018

Uh oh!

TomAugspurger commented Apr 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomAugspurger commented Apr 20, 2018

Uh oh!

codecov bot commented Apr 20, 2018

Codecov Report

Uh oh!

jorisvandenbossche commented Apr 20, 2018

Uh oh!

jorisvandenbossche commented Apr 20, 2018

Uh oh!

jreback Apr 21, 2018

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Apr 21, 2018

Choose a reason for hiding this comment

Uh oh!

jreback Apr 22, 2018

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jorisvandenbossche Apr 19, 2018 •

edited

Loading

TomAugspurger commented Apr 20, 2018 •

edited

Loading