Skip to content

Enncoders not compatible with sklearn pipelines now #265

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
nassimtaleb opened this issue Aug 11, 2020 · 6 comments
Closed

Enncoders not compatible with sklearn pipelines now #265

nassimtaleb opened this issue Aug 11, 2020 · 6 comments

Comments

@nassimtaleb
Copy link

Hi, while lookin at the code I realized that the encoders use the variable 'y' to pass information when transforming to use the 'train' behaviour or 'test' behaviour . This does not seem correct since when calling fit_transform on a sklearn pipeline, it first calls fit and then transform without the 'y' parameter. https://github.com/scikit-learn/scikit-learn/blob/0fb307bf39bbdacd6ed713c00724f8f871d60370/sklearn/pipeline.py#L742

An easy fix would be to directly define fit_transform and use 'y' there to get 'train' behaviour. And include only 'test' behaviour in the transform.

@janmotl
Copy link
Collaborator

janmotl commented Aug 11, 2020

Which encoders are we talking about?

@nassimtaleb
Copy link
Author

Leavoneout for sure, but I think all of them have the same issue. The only way put is by calling cvwrapper, which does apply fit-transform

@datacubeR
Copy link

Are there any updates on this? I have no experience solving issues but if there is any way to contribute to solve this issue? I'm having a hard time dealing with pipelines combined with TargetEncoder() and I would like to get this fixed.

Any guidelines will be deeply appreciated

@janmotl
Copy link
Collaborator

janmotl commented Sep 27, 2020

Transform method in LeaveOneOut is supposed to behave differently on training data and testing data. And it is known that it causes issues with sklearn pipelines. Nevertheless, unsupervised encoders can behave like any other encoder in sklearn. And possibly some supervised encoders can behave like that as well (I do not know which one, if any).

What can be done to make the situation better:

  1. Improve the documentation (e.g.: explicitly mark encoders that violate sklearn's expectations and mention possible workarounds).
  2. Write unit tests to ensure that encoders, which should work with sklearn pipelines, actually work (and will continue to work) with sklearn pipelines.
  3. If you modify the encoders, perform tests on real data to make sure that the generalization ability of the downstream classifiers/regressors does not degrade (e.g.: look in examples/benchmarking_large). One of the big issues of supervised encoders is that it may cause sever overfitting of the downstream models.

For contributing, check CONTRIBUTING.md. The issue is important and should not be left unfixed. However, I am not active in the project anymore. See #248.

@bmreiniger
Copy link
Contributor

I think this is handled by #246, which adds a custom fit_transform that then uses transform(X, y). Suggest to close, unless @nassimtaleb or someone else has a minimal example demonstrating there's still an issue?

@PaulWestenthanner
Copy link
Collaborator

closing as by suggestion of @bmreiniger

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants