-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Set correct missing value indicator in astype for categorical #45012
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
phofl
commented
Dec 22, 2021
- tests added / passed
- Ensure all linting tests pass, see here for how to run them
The failure comes from:
|
Question is if we want to support np.str in the first place... |
pandas/core/dtypes/missing.py
Outdated
@@ -640,6 +640,9 @@ def is_valid_na_for_dtype(obj, dtype: DtypeObj) -> bool: | |||
# Numeric | |||
return obj is not NaT and not isinstance(obj, (np.datetime64, np.timedelta64)) | |||
|
|||
elif dtype == np.dtype(str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this about?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do any tests rely on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, unfortunately
FAILED pandas/tests/frame/methods/test_astype.py::TestAstypeCategorical::test_astype_categorical_to_string_missing
It checks if df.astype(str)
is equal to df.astype("category").astype(str)
If we do not check this here, the float np.nan
is considered as a correct missing value for numpy string dtypes instead of the object np.nan
, which is used when df.astype(str)
is called immediately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it, can you add a comment (like we do below), also prob should define _dtype_str
(and _dtype_object
) at the module level and use here (can be followup)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the comment and moved the creation of the dtypes to the module level
pandas/core/dtypes/missing.py
Outdated
@@ -640,6 +640,9 @@ def is_valid_na_for_dtype(obj, dtype: DtypeObj) -> bool: | |||
# Numeric | |||
return obj is not NaT and not isinstance(obj, (np.datetime64, np.timedelta64)) | |||
|
|||
elif dtype == np.dtype(str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it, can you add a comment (like we do below), also prob should define _dtype_str
(and _dtype_object
) at the module level and use here (can be followup)
also this should add onto another whatsnew note? or this is separate? |
This is a follow up of #44930 and goes with the whatsnews there |
nice! |