Skip to content

Check array index fix #320

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

bmreiniger
Copy link
Contributor

@bmreiniger bmreiniger commented Oct 24, 2021

Closes #280.
Fixes #272, probably also #290, and supersedes #304.

Proposed Changes

Replaces consecutive calls to convert_input (on X) and convert_input_vector (on y) by a single convert_inputs to ensure that the indexes of the results match. This is necessary for proper functioning of encoders that group y by values of X, and convenient otherwise.

I don't like that convert_inputs is one character away from convert_input; other suggestions welcomed. One could convert all remaining convert_input calls to convert_inputs with the default y=None, so that convert_input would join convert_input_vector as only used inside convert_inputs.

I've also reduced the places where y gets cast to float, including that casting only when needed (in glmm where statsmodels would complain otherwise, and quantile where numpy.quantile would complain otherwise).

And since convert_input has a deep-copy option, I've consolidated a few of the copies into the convert_inputs; there are others that I've not consolidated, mostly because the copy happens further away in the code.

I'm not sure what needs to be done for a repository to "participate" in Hacktoberfest, but if it's as simple as a maintainer adding a label hacktoberfest-approved to the PR, I'd appreciate that.

Copy link
Collaborator

@PaulWestenthanner PaulWestenthanner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your pull request. It looks really nice!
Apart from my minor comments I would indeed suggest to only keep the convert_inputs function and also use it for the encoder that do not rely on a target y. In my opinion this makes the code more streamlined and hence easier to understand across different encoders. Do you agree or do you see any downside to this approach?
Regarding the hacktoberfest, I don't have the rights to add topics to a the project. I'd definitely see the requirements for a quality commit fulfilled. Maybe @wdm0006 could add us to the hacktoberfest.

if any(X.index != y.index):
raise ValueError("`X` and `y` both have indexes, but they do not match.")
if X.shape[0] != y.shape[0]:
raise ValueError(f"The length of X is {X.shape[0]} but length of y is {y.shape[0]}.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we still support python 3.5 (although we should probably change that). However, because of this we cannot work with f-strings at the moment

self.assertEqual(3, len(y))
self.assertTrue(list(X.index) == list(y.index) == bindex)

self.assertRaises(ValueError, convert_input_vector, barray, aseries)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you testing convert_input_vector here? shouldn't you rather test if an error is thrown by convert_inputs if indices of X and y differ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, oops, absolutely

@bmreiniger
Copy link
Contributor Author

I would indeed suggest to only keep the convert_inputs function and also use it for the encoder that do not rely on a target y. In my opinion this makes the code more streamlined and hence easier to understand across different encoders. Do you agree or do you see any downside to this approach?

I started working toward this and found one downside: to do this, we need to allow y=None in convert_inputs, but then we don't get the free check that y was provided in the supervised encoders. I could get around that with another parameter, or using y=False to indicate that y doesn't matter, but those seem rather less elegant.

@bmreiniger
Copy link
Contributor Author

bmreiniger commented Oct 27, 2021

Oh, and I need to check the two wrapper classes: they use convert_input on y to get a frame, but probably suffer from mismatched indexes as well.

...and the failed test seems to be a required cast to float, though I'm not sure how I missed that locally.

@PaulWestenthanner
Copy link
Collaborator

Ok I see the problem with using convert_inputs also for the unsupervised case. I agree that adding parameters is not a nice solution. I think the proper solution would be to create an abstract base class BaseEncoder and move the convert functions there. The BaseEncoder could have an abstract attribute supervised and then you just call self.convert_inputs(X, y=None). That should work right? Would you mind implementing this as well? Having a base class with this information was already suggested in #166 but never implemented.
Nice catch to point out the wrappers. I think the PolynomialWrapper would definitely benefit from converting X and y together (hence checking indices) and only then convert the series y to a dataframe.
The tests somehow only fail for python 3.5. Maybe that's some old numpy version as well. Did you check that locally with python 3.5 as well?

@bmreiniger
Copy link
Contributor Author

I think there's going to be a bit to discuss around setting up base class(es) and attributes. I'm happy to give it a go, but I'd rather merge this fix and take my time with that sort of refactoring.

I'll get a fix into the wrappers and check out the py3.5 issue as soon as I can.

@PaulWestenthanner
Copy link
Collaborator

Sounds like a good plan! Let's get this merged as soon as the python 3.5 tests succeed and proceed from there.

@bmreiniger bmreiniger mentioned this pull request Oct 29, 2021
@PaulWestenthanner
Copy link
Collaborator

Great work! Thank you. Looks good to me and merging :)

@PaulWestenthanner PaulWestenthanner merged commit cc0c4b9 into scikit-learn-contrib:master Oct 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Mismatched indexes in X,y esp. with sklearn pipelines Target Encoder outputs nan in Pipeline (or SimpleImputer+TargetEncoder)
3 participants