Check array index fix #320

bmreiniger · 2021-10-24T21:33:05Z

Closes #280.
Fixes #272, probably also #290, and supersedes #304.

Proposed Changes

Replaces consecutive calls to convert_input (on X) and convert_input_vector (on y) by a single convert_inputs to ensure that the indexes of the results match. This is necessary for proper functioning of encoders that group y by values of X, and convenient otherwise.

I don't like that convert_inputs is one character away from convert_input; other suggestions welcomed. One could convert all remaining convert_input calls to convert_inputs with the default y=None, so that convert_input would join convert_input_vector as only used inside convert_inputs.

I've also reduced the places where y gets cast to float, including that casting only when needed (in glmm where statsmodels would complain otherwise, and quantile where numpy.quantile would complain otherwise).

And since convert_input has a deep-copy option, I've consolidated a few of the copies into the convert_inputs; there are others that I've not consolidated, mostly because the copy happens further away in the code.

I'm not sure what needs to be done for a repository to "participate" in Hacktoberfest, but if it's as simple as a maintainer adding a label hacktoberfest-approved to the PR, I'd appreciate that.

PaulWestenthanner

Thanks for your pull request. It looks really nice!
Apart from my minor comments I would indeed suggest to only keep the convert_inputs function and also use it for the encoder that do not rely on a target y. In my opinion this makes the code more streamlined and hence easier to understand across different encoders. Do you agree or do you see any downside to this approach?
Regarding the hacktoberfest, I don't have the rights to add topics to a the project. I'd definitely see the requirements for a quality commit fulfilled. Maybe @wdm0006 could add us to the hacktoberfest.

PaulWestenthanner · 2021-10-26T16:22:38Z

category_encoders/utils.py

+        if any(X.index != y.index):
+            raise ValueError("`X` and `y` both have indexes, but they do not match.")
+        if X.shape[0] != y.shape[0]:
+            raise ValueError(f"The length of X is {X.shape[0]} but length of y is {y.shape[0]}.")


we still support python 3.5 (although we should probably change that). However, because of this we cannot work with f-strings at the moment

PaulWestenthanner · 2021-10-26T16:32:50Z

tests/test_utils.py

+        self.assertEqual(3, len(y))
+        self.assertTrue(list(X.index) == list(y.index) == bindex)
+
+        self.assertRaises(ValueError, convert_input_vector, barray, aseries)


why are you testing convert_input_vector here? shouldn't you rather test if an error is thrown by convert_inputs if indices of X and y differ?

oh, oops, absolutely

bmreiniger · 2021-10-27T02:11:39Z

I would indeed suggest to only keep the convert_inputs function and also use it for the encoder that do not rely on a target y. In my opinion this makes the code more streamlined and hence easier to understand across different encoders. Do you agree or do you see any downside to this approach?

I started working toward this and found one downside: to do this, we need to allow y=None in convert_inputs, but then we don't get the free check that y was provided in the supervised encoders. I could get around that with another parameter, or using y=False to indicate that y doesn't matter, but those seem rather less elegant.

bmreiniger · 2021-10-27T13:18:02Z

Oh, and I need to check the two wrapper classes: they use convert_input on y to get a frame, but probably suffer from mismatched indexes as well.

...and the failed test seems to be a required cast to float, though I'm not sure how I missed that locally.

PaulWestenthanner · 2021-10-27T19:34:37Z

Ok I see the problem with using convert_inputs also for the unsupervised case. I agree that adding parameters is not a nice solution. I think the proper solution would be to create an abstract base class BaseEncoder and move the convert functions there. The BaseEncoder could have an abstract attribute supervised and then you just call self.convert_inputs(X, y=None). That should work right? Would you mind implementing this as well? Having a base class with this information was already suggested in #166 but never implemented.
Nice catch to point out the wrappers. I think the PolynomialWrapper would definitely benefit from converting X and y together (hence checking indices) and only then convert the series y to a dataframe.
The tests somehow only fail for python 3.5. Maybe that's some old numpy version as well. Did you check that locally with python 3.5 as well?

bmreiniger · 2021-10-28T20:49:36Z

I think there's going to be a bit to discuss around setting up base class(es) and attributes. I'm happy to give it a go, but I'd rather merge this fix and take my time with that sort of refactoring.

I'll get a fix into the wrappers and check out the py3.5 issue as soon as I can.

PaulWestenthanner · 2021-10-29T07:09:23Z

Sounds like a good plan! Let's get this merged as soon as the python 3.5 tests succeed and proceed from there.

PaulWestenthanner · 2021-10-29T15:40:19Z

Great work! Thank you. Looks good to me and merging :)

bmreiniger added 3 commits October 24, 2021 14:01

Introduce simultaneous pandas-conversion of X,y

607bcb1

Use convert_inputs; refactor shape checks into it and allow y=None

51f54e1

Refactor some copies into convert_inputs

1c175d6

PaulWestenthanner reviewed Oct 26, 2021

View reviewed changes

Fix tests for check_inputs errors

dec84d7

Remove f-strings for python compatibility

35efc92

wdm0006 added the Hacktoberfest label Oct 27, 2021

Put back another float-cast for older dependencies

6af444d

bmreiniger added 3 commits October 29, 2021 08:53

Extend convert_inputs to wrapper classes

63603b8

Don't cast (multiclass, object) targets to float

daa3030

Add more-global test for pandas/numpy index mismatch

f925d29

bmreiniger mentioned this pull request Oct 29, 2021

Fix ohe nan col #322

Merged

PaulWestenthanner merged commit cc0c4b9 into scikit-learn-contrib:master Oct 29, 2021

PaulWestenthanner added the hacktoberfest-accepted label Oct 29, 2021

PaulWestenthanner mentioned this pull request Nov 3, 2021

Fix bad WOE scores. #304

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Check array index fix #320

Check array index fix #320

Uh oh!

bmreiniger commented Oct 24, 2021 •

edited

Loading

Uh oh!

PaulWestenthanner left a comment

Uh oh!

PaulWestenthanner Oct 26, 2021

Uh oh!

PaulWestenthanner Oct 26, 2021

Uh oh!

bmreiniger Oct 27, 2021

Uh oh!

bmreiniger commented Oct 27, 2021

Uh oh!

bmreiniger commented Oct 27, 2021 •

edited

Loading

Uh oh!

PaulWestenthanner commented Oct 27, 2021

Uh oh!

bmreiniger commented Oct 28, 2021

Uh oh!

PaulWestenthanner commented Oct 29, 2021

Uh oh!

PaulWestenthanner commented Oct 29, 2021

Uh oh!

Uh oh!

Check array index fix #320

Check array index fix #320

Uh oh!

Conversation

bmreiniger commented Oct 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed Changes

Uh oh!

PaulWestenthanner left a comment

Choose a reason for hiding this comment

Uh oh!

PaulWestenthanner Oct 26, 2021

Choose a reason for hiding this comment

Uh oh!

PaulWestenthanner Oct 26, 2021

Choose a reason for hiding this comment

Uh oh!

bmreiniger Oct 27, 2021

Choose a reason for hiding this comment

Uh oh!

bmreiniger commented Oct 27, 2021

Uh oh!

bmreiniger commented Oct 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PaulWestenthanner commented Oct 27, 2021

Uh oh!

bmreiniger commented Oct 28, 2021

Uh oh!

PaulWestenthanner commented Oct 29, 2021

Uh oh!

PaulWestenthanner commented Oct 29, 2021

Uh oh!

Uh oh!

bmreiniger commented Oct 24, 2021 •

edited

Loading

bmreiniger commented Oct 27, 2021 •

edited

Loading