Skip to content

BUG: Fix is_categorical_dtype for Sparse[category] #35797

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 12 commits into from
Closed

BUG: Fix is_categorical_dtype for Sparse[category] #35797

wants to merge 12 commits into from

Conversation

dsaxton
Copy link
Member

@dsaxton dsaxton commented Aug 19, 2020

@dsaxton dsaxton changed the title Fix is_categorical_dtype for categorical SparseArray Fix is_categorical_dtype for Sparse[category] Aug 19, 2020
@dsaxton dsaxton added Categorical Categorical Data Type Dtype Conversions Unexpected or buggy dtype conversions Sparse Sparse Data Type labels Aug 19, 2020
@dsaxton dsaxton added this to the 1.2 milestone Aug 19, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls perf test - this is a hot spot

@dsaxton
Copy link
Member Author

dsaxton commented Aug 19, 2020

Ran the categoricals.py asvs and somehow performance improved (?):

       before           after         ratio
     [97ec0627]       [b52a7eab]
     <master>         <sparse-categorical-dtype>
-      10.0±0.3ms       9.05±0.2ms     0.90  categoricals.Isin.time_isin_categorical('int64')
-      13.5±0.2ms         11.9±1ms     0.88  categoricals.Isin.time_isin_categorical('object')
-        5.75±2μs       4.00±0.1μs     0.69  categoricals.Indexing.time_get_loc

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

@dsaxton
Copy link
Member Author

dsaxton commented Aug 19, 2020

mypy error seems wrong, dtype.subtype comes after a hasattr(dtype, "subtype") check:

pandas/core/dtypes/base.py:292: error: "object" has no attribute "subtype"  [attr-defined]

Any ideas on how best to fix? Could explicitly ignore or maybe use something like isinstance(getattr(dtype, "subtype", None), cls)

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change actually something we necessarily want?

For example one use case where you might check that something is a categorical, is to know that you can then use certain APIs specific to categoricals. But if a Sparse type also gives true, that won't longer be true.

@dsaxton
Copy link
Member Author

dsaxton commented Aug 19, 2020

Is this change actually something we necessarily want?

For example one use case where you might check that something is a categorical, is to know that you can then use certain APIs specific to categoricals. But if a Sparse type also gives true, that won't longer be true.

Hmm, interesting point. I think the argument from the original issue seems valid though, that if Sparse[int] is an int, then Sparse[categorical] should be a categorical (making something sparse shouldn't change that). I suppose adopting that view would mean that there's really less of an API specific to categoricals, so some special casing might have to happen as a result.

I can see the point though, that this could just make things more confusing in other ways.

@jorisvandenbossche
Copy link
Member

Yes, but as I noted on the issue: if there is an inconsistency, there are always two ways this can be solved. Meaning: we can also change Sparse[int] to no longer be int, or at least check what the use cases are for it being considered int (there quite possibly are, but I think we should try to be concrete first before deciding how to solve the inconsistency)

@dsaxton
Copy link
Member Author

dsaxton commented Aug 20, 2020

Meaning: we can also change Sparse[int] to no longer be int, or at least check what the use cases are for it being considered int (there quite possibly are, but I think we should try to be concrete first before deciding how to solve the inconsistency)

The more I think about it the more this seems like the better option over changing this for categorical; is_categorical_dtype loses most of it's usefulness if it doesn't guarantee access to the API specific to categoricals.

So then to be consistent sparse dtypes should really only register as sparse (not int, float, whatever)? I would be curious if that's actually easier internally, or what that might break if changed (or if you even bother changing it at all).

@simonjayhawkins
Copy link
Member

mypy error seems wrong, dtype.subtype comes after a hasattr(dtype, "subtype") check:

pandas/core/dtypes/base.py:292: error: "object" has no attribute "subtype"  [attr-defined]

Any ideas on how best to fix? Could explicitly ignore or maybe use something like isinstance(getattr(dtype, "subtype", None), cls)

see python/mypy#1424

use #type: ignore and add a comment referencing the mypy issue

@dsaxton dsaxton removed this from the 1.2 milestone Aug 22, 2020
@dsaxton
Copy link
Member Author

dsaxton commented Sep 15, 2020

@jorisvandenbossche One possible counterpoint is that is_categorical_dtype already doesn't guarantee access to the categorical API:

[ins] In [1]: import pandas as pd

[ins] In [2]: from pandas.core.dtypes.common import is_categorical_dtype

[ins] In [3]: s = pd.Series(pd.Categorical(["a", "b", "c"]))

[ins] In [4]: is_categorical_dtype(s)
Out[4]: True

[ins] In [5]: s.codes
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-5-c9d8be9c51ce> in <module>
----> 1 s.codes

~/opt/miniconda3/envs/pandas-1.1.2/lib/python3.8/site-packages/pandas/core/generic.py in __getattr__(self, name)
   5134             if self._info_axis._can_hold_identifiers_and_holds_name(name):
   5135                 return self[name]
-> 5136             return object.__getattribute__(self, name)
   5137
   5138     def __setattr__(self, name: str, value) -> None:

AttributeError: 'Series' object has no attribute 'codes'

That assumption caused a bug here: #36383

@pep8speaks
Copy link

pep8speaks commented Sep 15, 2020

Hello @dsaxton! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-09-19 13:01:13 UTC

@@ -287,6 +287,10 @@ def is_dtype(cls, dtype: object) -> bool:
return False
elif isinstance(dtype, cls):
return True
elif hasattr(dtype, "subtype") and isinstance(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need to add this, this is a perf sensitive piece.

Copy link
Member Author

@dsaxton dsaxton Sep 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's needed for test_is_categorical_sparse_categorical to pass; I'm not sure if there's another way to make this work. I ran the dtype asvs and didn't see a degradation.

@dsaxton dsaxton changed the title Fix is_categorical_dtype for Sparse[category] BUG: Fix is_categorical_dtype for Sparse[category] Sep 16, 2020
@jreback jreback added this to the 1.2 milestone Sep 19, 2020
@jreback
Copy link
Contributor

jreback commented Sep 19, 2020

code change looks ok. but I think @jorisvandenbossche is right here; if is_categorical_dtype is true then this should act like a categorical, which we currently don't do.

@jreback jreback removed this from the 1.2 milestone Sep 19, 2020
@jorisvandenbossche
Copy link
Member

One possible counterpoint is that is_categorical_dtype already doesn't guarantee access to the categorical API:

But it does guarantee that on a Series .cat.codes is available (it's true you still need to know if it's a Series[Categorical] or Categorical).

Also, do we actually support Sparse[category] properly?
Eg I just took the example from the test you added, and the repr fails:

In [1]: s = pd.Series(["a", "b", "c"], dtype=pd.SparseDtype(pd.CategoricalDtype(["a", "b", "c"])))                                                                                                                 

In [2]: s                                                                                                                                                                                                          
...
ValueError: object __array__ method not producing an array

Some time ago I opened a general issue about the question to support Sparse backed by other ExtensionArrays or not: #26407

Co-authored-by: Joris Van den Bossche <[email protected]>
@dsaxton
Copy link
Member Author

dsaxton commented Sep 19, 2020

Closing since it seems there isn't a desire to change this

@dsaxton dsaxton closed this Sep 19, 2020
@dsaxton dsaxton deleted the sparse-categorical-dtype branch September 19, 2020 16:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Dtype Conversions Unexpected or buggy dtype conversions Sparse Sparse Data Type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: is_categorical_dtype returns False for Sparse[category, nan]
6 participants