-
Notifications
You must be signed in to change notification settings - Fork 400
OrdinalEncoder fails when the input is entirely numeric or can be converted to numeric values #229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
So this happens because on the call to X = X.apply(lambda x: pd.to_numeric(x, errors='ignore')) Then runs if not list(self.cols):
return X if self.return_df else X.values I don't really see why that silent conversion to numeric values is done. I think that the ideal solution is to remove it, but I suspect that that will be a little bit less than ideally backwards compatible. It will mean that people who fail to explicitly specify columns to encode OR convert intended-to-be-numeric object columns to numeric values prior to fitting an OrdinalEncoder will accidentally encode features that are intended to be numeric, which probably means a warning should be raised. Alternatively, I think that we could edit ordinal.py to change how X is returned when if not list(self.cols):
return X if self.return_df or override_return_df else X.values Which is how it's handled at the end of the if self.return_df or override_return_df:
return X
else:
return X.values Thoughts? |
Nice bug report. I removed the silent conversion. All the current tests + some new tests seem to work fine. Do you have a code where the removal of the silent conversion would change the behaviour? I didn't find anything. |
I think so yeah. So for example an array containing numeric data formatted as string that is not intended to be encoded as categories. Under the current behaviour, this is basically just converted to numbers, which will work fine: >>> import numpy as np
>>> from category_encoders import OrdinalEncoder
>>> X = np.array(
... [
... ['A','1'],
... ['2','9'],
... ['3','3.5'],
... ['4','4'],
... ['5','5']
... ]
...)
>>> oe = OrdinalEncoder(return_df=False)
>>> oe.fit_transform(X)
array([[1. , 1. ],
[2. , 9. ],
[3. , 3.5],
[4. , 4. ],
[5. , 5. ]]) So say the first column is categorical, and that's the column the user intends to encode. The second is numeric data, and the user does not intend to encode it. It's simply converted as is into a numeric format, and so all is hunky dory. After this change though it will be treated as another categorical column and encoded, so I would expect the output to change to this: array([[1. , 1. ],
[2. , 2. ],
[3. , 3. ],
[4. , 4. ],
[5. , 5. ]]) It will still work downstream in an sklearn estimator or whatever, but that's kind of A Bad Thing™ because it means the code will still function (and thus the user isn't alerted by a failure to the fact that something is wrong) but the output of a model trained on that data is likely to suffer, since the meaning of that feature will be ruined. I still think that that's the right approach, but it probably needs a warning or something. |
I get it now. I think it deserves a warning in the release notes like:
If you want a warning in the code, propose the code (with a description how to silent it). |
Sure, I can do that. I'll probably reference sklearn's ChangedBehaviorWarning from sklearn.exceptions, unless you'd rather I don't do that. I'll make the warning message be something like:
|
Sounds good to me. |
The example given works now. Probably this was fixed when the whole |
Uh oh!
There was an error while loading. Please reload this page.
Summary
OrdinalEncoder.fit()
throws an exception when the input values are entirely numeric (I.E.[1, 2, 3, 4, 5]
) or can be converted to be numeric (I.E.['001', '002', '003', '004', '005']
, or even worse strings containing whitespace like[' 01 ', ' 02 ', ' 03 ']
).Version of category_encoders
2.1.0
Reproducible code
Expected Outcome
The encoder is fit and returns itself.
Actual Outcome
The following exception is thrown:
This was preventing me from using OrdinalEncoder to encode some Item ID numbers, since they are comprised entirely of numeric values (albeit stored as text). By way of explanation as to why I was doing it in the first place; the Item IDs themselves are usually some long and unwieldy number like 00311684374, and moreover there's a bunch of stuff that specifically wants to know the number of unique categories in a feature, for which a simple
.max()
on the encoded column can be employed. There are other things in a machine learning Pipeline (Keras Embedding layers being the example that have triggered this) that expect all values in a feature to be within the range 0:max_encoded_value, and that's harder to code for without actually encoding the values. Finally, although they're currently all numeric, the field in the source of the data is very much _alpha_numeric, and I definitely can't guarantee that they will remain numeric only in the future. Currently, I can't future-proof that possibility using this class.The text was updated successfully, but these errors were encountered: