Skip to content

Fix pd.merge to preserve ExtensionArrays dtypes #20745

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Apr 22, 2018

Conversation

jorisvandenbossche
Copy link
Member

Closes #20743

@jorisvandenbossche jorisvandenbossche added the ExtensionArray Extending pandas with custom dtypes or arrays. label Apr 19, 2018
@jorisvandenbossche jorisvandenbossche added this to the 0.23.0 milestone Apr 19, 2018
@@ -5541,7 +5541,7 @@ def concatenate_join_units(join_units, concat_axis, copy):
if len(to_concat) == 1:
# Only one block, nothing to concatenate.
concat_values = to_concat[0]
if copy and concat_values.base is not None:
if copy and getattr(concat_values, 'base', 1) is not None:
Copy link
Member Author

@jorisvandenbossche jorisvandenbossche Apr 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind of a hack, would like to have a better solution.

(related to discussion earlier this day in #20721 (comment) about deprecating Index.base)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should only check base if concat_values is an ndarray

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback updated

@jorisvandenbossche
Copy link
Member Author

Note to self: 1 actual failing test (rest are network failures):

___________________________ TestReshaping.test_merge ___________________________
[gw1] linux -- Python 3.6.5 /home/travis/miniconda3/envs/pandas/bin/python
self = <pandas.tests.extension.json.test_json.TestReshaping object at 0x7fb92d208dd8>
data = JSONArary([{'D': 91, 'k': 84, 'L': 34, 'm': 63, 'C': 8, 'O': 1, 'R': 5}, {'T': 49, 'o': 46}, {'g': 81, 'O': 8, 'p': 54...': 32, 'p': 0, 'J': 89, 'd': 32, 's': 22}, {'W': 70, 'Y': 68, 'D': 87, 'k': 52}, {'F': 29, 'h': 10, 'S': 41, 'y': 24}])
na_value = {}
    def test_merge(self, data, na_value):
    
        df1 = pd.DataFrame({'int1': [1, 2, 3], 'key': [0, 1, 2],
                            'ext': data[:3]})
        df2 = pd.DataFrame({'int2': [1, 2, 3, 4], 'key': [0, 0, 1, 3]})
    
        res = pd.merge(df1, df2)
        exp = pd.DataFrame(
            {'int1': [1, 1, 2], 'int2': [1, 2, 3], 'key': [0, 0, 1],
             'ext': data._constructor_from_sequence(
                 [data[0], data[0], data[1]])})
>       self.assert_frame_equal(res, exp[['ext', 'int1', 'key', 'int2']])
pandas/tests/extension/base/reshaping.py:110: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

E       AssertionError: DataFrame.columns are different
E       
E       DataFrame.columns values are different (75.0 %)
E       [left]:  Index(['int1', 'key', 'ext', 'int2'], dtype='object')
E       [right]: Index(['ext', 'int1', 'key', 'int2'], dtype='object')

Not seeing this locally, and also not failing for Decimal, so the column order seems to depend on the extension dtype? (which is a bit strange)

@@ -5541,7 +5541,7 @@ def concatenate_join_units(join_units, concat_axis, copy):
if len(to_concat) == 1:
# Only one block, nothing to concatenate.
concat_values = to_concat[0]
if copy and concat_values.base is not None:
if copy and getattr(concat_values, 'base', 1) is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should only check base if concat_values is an ndarray

@@ -95,3 +95,24 @@ def test_set_frame_overwrite_object(self, data):
df = pd.DataFrame({"A": [1] * len(data)}, dtype=object)
df['A'] = data
assert df.dtypes['A'] == data.dtype

def test_merge(self, data, na_value):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue number

@TomAugspurger
Copy link
Contributor

I can reproduce your failure locally. Looking into it now.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Apr 20, 2018

Ah, bug in BaseDecimal.assert_frame_equal not checking order of all the columns. It only considered extension & non extension columns independently.

So the test is broken for all types, not just JSON.

The Decimal base class failed to check the global order of columns. Fixed
that as well.
@TomAugspurger
Copy link
Contributor

Pushed a commit fixing the Decimal tests base class and adjust the expected order. Hopefully this will pass things.

@codecov
Copy link

codecov bot commented Apr 20, 2018

Codecov Report

Merging #20745 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #20745      +/-   ##
==========================================
+ Coverage   91.84%   91.85%   +<.01%     
==========================================
  Files         153      153              
  Lines       49303    49307       +4     
==========================================
+ Hits        45282    45289       +7     
+ Misses       4021     4018       -3
Flag Coverage Δ
#multiple 90.24% <100%> (ø) ⬆️
#single 41.89% <66.66%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/internals.py 95.57% <100%> (+0.03%) ⬆️
pandas/core/dtypes/common.py 94.64% <100%> (ø) ⬆️
pandas/core/indexes/datetimelike.py 96.72% <0%> (ø) ⬆️
pandas/core/frame.py 97.16% <0%> (ø) ⬆️
pandas/util/testing.py 84.79% <0%> (+0.2%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 78fee04...9cf8cfe. Read the comment docs.

@jorisvandenbossche
Copy link
Member Author

Thanks for the fix! The order was indeed the culprit, but changed it again to be alphabetically in the original test frame, so it behaves the same for python < 3.6

@jorisvandenbossche
Copy link
Member Author

For Travis, the remaining failure is an s3 one, so this seems to be passing now.

@@ -95,3 +95,24 @@ def test_set_frame_overwrite_object(self, data):
df = pd.DataFrame({"A": [1] * len(data)}, dtype=object)
df['A'] = data
assert df.dtypes['A'] == data.dtype

def test_merge(self, data, na_value):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you prob should test with with the how=join_type fixture.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They both need a different expected result, I don't think it is really worth it here in this case?
(the test is also not meant to be a full cover of the merge function (for that we already have other tests), just to test that basic use cases of concatting works with extension arrays)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, its worth doing way more tests than a single usecase. but ok here I guess.

@jreback jreback merged commit 0ae7e90 into pandas-dev:master Apr 22, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants