-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: Fix is_categorical_dtype for Sparse[category] #35797
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls perf test - this is a hot spot
Ran the categoricals.py asvs and somehow performance improved (?):
|
mypy error seems wrong, dtype.subtype comes after a hasattr(dtype, "subtype") check:
Any ideas on how best to fix? Could explicitly ignore or maybe use something like isinstance(getattr(dtype, "subtype", None), cls) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change actually something we necessarily want?
For example one use case where you might check that something is a categorical, is to know that you can then use certain APIs specific to categoricals. But if a Sparse type also gives true, that won't longer be true.
Hmm, interesting point. I think the argument from the original issue seems valid though, that if Sparse[int] is an int, then Sparse[categorical] should be a categorical (making something sparse shouldn't change that). I suppose adopting that view would mean that there's really less of an API specific to categoricals, so some special casing might have to happen as a result. I can see the point though, that this could just make things more confusing in other ways. |
Yes, but as I noted on the issue: if there is an inconsistency, there are always two ways this can be solved. Meaning: we can also change Sparse[int] to no longer be int, or at least check what the use cases are for it being considered int (there quite possibly are, but I think we should try to be concrete first before deciding how to solve the inconsistency) |
The more I think about it the more this seems like the better option over changing this for categorical; is_categorical_dtype loses most of it's usefulness if it doesn't guarantee access to the API specific to categoricals. So then to be consistent sparse dtypes should really only register as sparse (not int, float, whatever)? I would be curious if that's actually easier internally, or what that might break if changed (or if you even bother changing it at all). |
see python/mypy#1424 use |
@jorisvandenbossche One possible counterpoint is that is_categorical_dtype already doesn't guarantee access to the categorical API: [ins] In [1]: import pandas as pd
[ins] In [2]: from pandas.core.dtypes.common import is_categorical_dtype
[ins] In [3]: s = pd.Series(pd.Categorical(["a", "b", "c"]))
[ins] In [4]: is_categorical_dtype(s)
Out[4]: True
[ins] In [5]: s.codes
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-5-c9d8be9c51ce> in <module>
----> 1 s.codes
~/opt/miniconda3/envs/pandas-1.1.2/lib/python3.8/site-packages/pandas/core/generic.py in __getattr__(self, name)
5134 if self._info_axis._can_hold_identifiers_and_holds_name(name):
5135 return self[name]
-> 5136 return object.__getattribute__(self, name)
5137
5138 def __setattr__(self, name: str, value) -> None:
AttributeError: 'Series' object has no attribute 'codes' That assumption caused a bug here: #36383 |
@@ -287,6 +287,10 @@ def is_dtype(cls, dtype: object) -> bool: | |||
return False | |||
elif isinstance(dtype, cls): | |||
return True | |||
elif hasattr(dtype, "subtype") and isinstance( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really need to add this, this is a perf sensitive piece.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's needed for test_is_categorical_sparse_categorical to pass; I'm not sure if there's another way to make this work. I ran the dtype asvs and didn't see a degradation.
code change looks ok. but I think @jorisvandenbossche is right here; if |
But it does guarantee that on a Series Also, do we actually support Sparse[category] properly?
Some time ago I opened a general issue about the question to support Sparse backed by other ExtensionArrays or not: #26407 |
Co-authored-by: Joris Van den Bossche <[email protected]>
Closing since it seems there isn't a desire to change this |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff