Enncoders not compatible with sklearn pipelines now #265

nassimtaleb · 2020-08-11T19:32:43Z

Hi, while lookin at the code I realized that the encoders use the variable 'y' to pass information when transforming to use the 'train' behaviour or 'test' behaviour . This does not seem correct since when calling fit_transform on a sklearn pipeline, it first calls fit and then transform without the 'y' parameter. https://github.com/scikit-learn/scikit-learn/blob/0fb307bf39bbdacd6ed713c00724f8f871d60370/sklearn/pipeline.py#L742

An easy fix would be to directly define fit_transform and use 'y' there to get 'train' behaviour. And include only 'test' behaviour in the transform.

janmotl · 2020-08-11T19:49:19Z

Which encoders are we talking about?

nassimtaleb · 2020-08-26T23:23:09Z

Leavoneout for sure, but I think all of them have the same issue. The only way put is by calling cvwrapper, which does apply fit-transform

datacubeR · 2020-09-26T23:02:00Z

Are there any updates on this? I have no experience solving issues but if there is any way to contribute to solve this issue? I'm having a hard time dealing with pipelines combined with TargetEncoder() and I would like to get this fixed.

Any guidelines will be deeply appreciated

janmotl · 2020-09-27T13:03:49Z

Transform method in LeaveOneOut is supposed to behave differently on training data and testing data. And it is known that it causes issues with sklearn pipelines. Nevertheless, unsupervised encoders can behave like any other encoder in sklearn. And possibly some supervised encoders can behave like that as well (I do not know which one, if any).

What can be done to make the situation better:

Improve the documentation (e.g.: explicitly mark encoders that violate sklearn's expectations and mention possible workarounds).
Write unit tests to ensure that encoders, which should work with sklearn pipelines, actually work (and will continue to work) with sklearn pipelines.
If you modify the encoders, perform tests on real data to make sure that the generalization ability of the downstream classifiers/regressors does not degrade (e.g.: look in examples/benchmarking_large). One of the big issues of supervised encoders is that it may cause sever overfitting of the downstream models.

For contributing, check CONTRIBUTING.md. The issue is important and should not be left unfixed. However, I am not active in the project anymore. See #248.

bmreiniger · 2021-10-24T21:26:22Z

I think this is handled by #246, which adds a custom fit_transform that then uses transform(X, y). Suggest to close, unless @nassimtaleb or someone else has a minimal example demonstrating there's still an issue?

PaulWestenthanner · 2022-06-02T11:29:22Z

closing as by suggestion of @bmreiniger

datacubeR mentioned this issue Sep 26, 2020

Target Encoder outputs nan in Pipeline (or SimpleImputer+TargetEncoder) #272

Closed

PaulWestenthanner closed this as completed Jun 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enncoders not compatible with sklearn pipelines now #265

Enncoders not compatible with sklearn pipelines now #265

nassimtaleb commented Aug 11, 2020

janmotl commented Aug 11, 2020

Uh oh!

nassimtaleb commented Aug 26, 2020

Uh oh!

datacubeR commented Sep 26, 2020

Uh oh!

janmotl commented Sep 27, 2020

Uh oh!

bmreiniger commented Oct 24, 2021

Uh oh!

PaulWestenthanner commented Jun 2, 2022

Uh oh!

Enncoders not compatible with sklearn pipelines now #265

Enncoders not compatible with sklearn pipelines now #265

Comments

nassimtaleb commented Aug 11, 2020

janmotl commented Aug 11, 2020

Uh oh!

nassimtaleb commented Aug 26, 2020

Uh oh!

datacubeR commented Sep 26, 2020

Uh oh!

janmotl commented Sep 27, 2020

Uh oh!

bmreiniger commented Oct 24, 2021

Uh oh!

PaulWestenthanner commented Jun 2, 2022

Uh oh!