Skip to content

Set correct missing value indicator in astype for categorical #45012

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Dec 28, 2021

Conversation

phofl
Copy link
Member

@phofl phofl commented Dec 22, 2021

  • tests added / passed
  • Ensure all linting tests pass, see here for how to run them

@phofl phofl added Categorical Categorical Data Type Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Dec 22, 2021
@phofl
Copy link
Member Author

phofl commented Dec 22, 2021

The failure comes from:

np.nan as float is considered as valid na type for str dtype. I think this should return False instead of True?

@phofl
Copy link
Member Author

phofl commented Dec 22, 2021

Question is if we want to support np.str in the first place...

@@ -640,6 +640,9 @@ def is_valid_na_for_dtype(obj, dtype: DtypeObj) -> bool:
# Numeric
return obj is not NaT and not isinstance(obj, (np.datetime64, np.timedelta64))

elif dtype == np.dtype(str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this about?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do any tests rely on this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, unfortunately

FAILED pandas/tests/frame/methods/test_astype.py::TestAstypeCategorical::test_astype_categorical_to_string_missing

It checks if df.astype(str) is equal to df.astype("category").astype(str)

If we do not check this here, the float np.nan is considered as a correct missing value for numpy string dtypes instead of the object np.nan, which is used when df.astype(str) is called immediately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, can you add a comment (like we do below), also prob should define _dtype_str (and _dtype_object) at the module level and use here (can be followup)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the comment and moved the creation of the dtypes to the module level

@@ -640,6 +640,9 @@ def is_valid_na_for_dtype(obj, dtype: DtypeObj) -> bool:
# Numeric
return obj is not NaT and not isinstance(obj, (np.datetime64, np.timedelta64))

elif dtype == np.dtype(str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, can you add a comment (like we do below), also prob should define _dtype_str (and _dtype_object) at the module level and use here (can be followup)

@jreback jreback added this to the 1.4 milestone Dec 28, 2021
@jreback
Copy link
Contributor

jreback commented Dec 28, 2021

also this should add onto another whatsnew note? or this is separate?

@phofl
Copy link
Member Author

phofl commented Dec 28, 2021

This is a follow up of #44930 and goes with the whatsnews there

@jreback jreback merged commit 3aed81c into pandas-dev:master Dec 28, 2021
@jreback
Copy link
Contributor

jreback commented Dec 28, 2021

nice!

@phofl phofl deleted the cln_na_value_astype_cat branch December 28, 2021 22:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants