Skip to content

Target Encoder outputs nan in Pipeline (or SimpleImputer+TargetEncoder) #272

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
datacubeR opened this issue Sep 26, 2020 · 3 comments · Fixed by #320
Closed

Target Encoder outputs nan in Pipeline (or SimpleImputer+TargetEncoder) #272

datacubeR opened this issue Sep 26, 2020 · 3 comments · Fixed by #320

Comments

@datacubeR
Copy link

I have been modelling using the ames_housing dataset with the code attached in the following zip file.

rep_example.zip

Weird thing is that in a dataset with no nulls, adding a SimpleImputer along with a TargetEncoder(), several null values start to come out.

I'm not sure if I'm doing something wrong, but if using SimpleImputer with no Null values, nothing should happen. And actually I ran process separately and simpleImputer will not output any null value. But, once this Numpy array goes into TargetEncoder() it will output more than 2000 Nulls.

Why is that?

Expected Behavior

If no nulls are provided, no nulls should Output. See attached notebook, such when running the TargetEncoder by its own.

image

Actual Behavior

image

image

Steps to Reproduce the Problem

Refer to attached notebook with example code.

Specifications

  • Version: 2.2.2
  • Platform: Windows 10
  • Subsystem: Python 3.7.7

Thanks Guys,

Alfonso

@datacubeR
Copy link
Author

Just to add something extra. I've noticed that using Pipeline and Doing by its own clculates completely different values:

image

image

You can see some NaN coming up in columns 2 and 3, I would think it is because they are numbers, and since they come from a numpy array it has no way to determine the incoming dtype.

I updated to pandas 1.1.2 to check if that helped but I had no luck.

Not sure if this has to do with #266 or #265.

will rolling back to a previous version help?

Thanks,

@datacubeR
Copy link
Author

I've noticed that this only happens when trying to use something that is or used to be a numpy array. So if some kind of underlying metadata tells the encoder this is coming from numpy it will tangle and provide null values.

Reading some other issues out there noticed that this could be related? Sorry to post so much, I'm just trying contribute as much as I can to solve this issue.

@salmanea
Copy link

Just use reset.index before encoding. It will solve the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants