BUG: Fix is_categorical_dtype for Sparse[category] #35797

dsaxton · 2020-08-19T00:45:44Z

closes BUG: is_categorical_dtype returns False for Sparse[category, nan] #35793
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

jreback

pls perf test - this is a hot spot

pandas/core/dtypes/common.py

dsaxton · 2020-08-19T02:35:42Z

Ran the categoricals.py asvs and somehow performance improved (?):

       before           after         ratio
     [97ec0627]       [b52a7eab]
     <master>         <sparse-categorical-dtype>
-      10.0±0.3ms       9.05±0.2ms     0.90  categoricals.Isin.time_isin_categorical('int64')
-      13.5±0.2ms         11.9±1ms     0.88  categoricals.Isin.time_isin_categorical('object')
-        5.75±2μs       4.00±0.1μs     0.69  categoricals.Indexing.time_get_loc

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

dsaxton · 2020-08-19T02:51:56Z

mypy error seems wrong, dtype.subtype comes after a hasattr(dtype, "subtype") check:

pandas/core/dtypes/base.py:292: error: "object" has no attribute "subtype"  [attr-defined]

Any ideas on how best to fix? Could explicitly ignore or maybe use something like isinstance(getattr(dtype, "subtype", None), cls)

pandas/core/dtypes/base.py

jorisvandenbossche

Is this change actually something we necessarily want?

For example one use case where you might check that something is a categorical, is to know that you can then use certain APIs specific to categoricals. But if a Sparse type also gives true, that won't longer be true.

dsaxton · 2020-08-19T15:17:58Z

Is this change actually something we necessarily want?

For example one use case where you might check that something is a categorical, is to know that you can then use certain APIs specific to categoricals. But if a Sparse type also gives true, that won't longer be true.

Hmm, interesting point. I think the argument from the original issue seems valid though, that if Sparse[int] is an int, then Sparse[categorical] should be a categorical (making something sparse shouldn't change that). I suppose adopting that view would mean that there's really less of an API specific to categoricals, so some special casing might have to happen as a result.

I can see the point though, that this could just make things more confusing in other ways.

jorisvandenbossche · 2020-08-20T07:29:12Z

Yes, but as I noted on the issue: if there is an inconsistency, there are always two ways this can be solved. Meaning: we can also change Sparse[int] to no longer be int, or at least check what the use cases are for it being considered int (there quite possibly are, but I think we should try to be concrete first before deciding how to solve the inconsistency)

dsaxton · 2020-08-20T21:19:41Z

Meaning: we can also change Sparse[int] to no longer be int, or at least check what the use cases are for it being considered int (there quite possibly are, but I think we should try to be concrete first before deciding how to solve the inconsistency)

The more I think about it the more this seems like the better option over changing this for categorical; is_categorical_dtype loses most of it's usefulness if it doesn't guarantee access to the API specific to categoricals.

So then to be consistent sparse dtypes should really only register as sparse (not int, float, whatever)? I would be curious if that's actually easier internally, or what that might break if changed (or if you even bother changing it at all).

simonjayhawkins · 2020-08-21T12:48:34Z

mypy error seems wrong, dtype.subtype comes after a hasattr(dtype, "subtype") check:
pandas/core/dtypes/base.py:292: error: "object" has no attribute "subtype"  [attr-defined]
Any ideas on how best to fix? Could explicitly ignore or maybe use something like isinstance(getattr(dtype, "subtype", None), cls)

see python/mypy#1424

use #type: ignore and add a comment referencing the mypy issue

dsaxton · 2020-09-15T19:04:13Z

@jorisvandenbossche One possible counterpoint is that is_categorical_dtype already doesn't guarantee access to the categorical API:

[ins] In [1]: import pandas as pd

[ins] In [2]: from pandas.core.dtypes.common import is_categorical_dtype

[ins] In [3]: s = pd.Series(pd.Categorical(["a", "b", "c"]))

[ins] In [4]: is_categorical_dtype(s)
Out[4]: True

[ins] In [5]: s.codes
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-5-c9d8be9c51ce> in <module>
----> 1 s.codes

~/opt/miniconda3/envs/pandas-1.1.2/lib/python3.8/site-packages/pandas/core/generic.py in __getattr__(self, name)
   5134             if self._info_axis._can_hold_identifiers_and_holds_name(name):
   5135                 return self[name]
-> 5136             return object.__getattribute__(self, name)
   5137
   5138     def __setattr__(self, name: str, value) -> None:

AttributeError: 'Series' object has no attribute 'codes'

That assumption caused a bug here: #36383

…l-dtype

pep8speaks · 2020-09-15T19:24:49Z

Hello @dsaxton! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-09-19 13:01:13 UTC

doc/source/whatsnew/v1.2.0.rst

pandas/core/dtypes/common.py

jreback · 2020-09-15T22:39:37Z

pandas/core/dtypes/base.py

@@ -287,6 +287,10 @@ def is_dtype(cls, dtype: object) -> bool:
            return False
        elif isinstance(dtype, cls):
            return True
+        elif hasattr(dtype, "subtype") and isinstance(


do we really need to add this, this is a perf sensitive piece.

It's needed for test_is_categorical_sparse_categorical to pass; I'm not sure if there's another way to make this work. I ran the dtype asvs and didn't see a degradation.

pandas/tests/dtypes/test_common.py

…l-dtype

jreback · 2020-09-19T02:09:55Z

code change looks ok. but I think @jorisvandenbossche is right here; if is_categorical_dtype is true then this should act like a categorical, which we currently don't do.

doc/source/whatsnew/v1.2.0.rst

jorisvandenbossche · 2020-09-19T08:10:58Z

One possible counterpoint is that is_categorical_dtype already doesn't guarantee access to the categorical API:

But it does guarantee that on a Series .cat.codes is available (it's true you still need to know if it's a Series[Categorical] or Categorical).

Also, do we actually support Sparse[category] properly?
Eg I just took the example from the test you added, and the repr fails:

In [1]: s = pd.Series(["a", "b", "c"], dtype=pd.SparseDtype(pd.CategoricalDtype(["a", "b", "c"])))                                                                                                                 

In [2]: s                                                                                                                                                                                                          
...
ValueError: object __array__ method not producing an array

Some time ago I opened a general issue about the question to support Sparse backed by other ExtensionArrays or not: #26407

Co-authored-by: Joris Van den Bossche <[email protected]>

dsaxton · 2020-09-19T16:39:38Z

Closing since it seems there isn't a desire to change this

dsaxton changed the title ~~Fix is_categorical_dtype for categorical SparseArray~~ Fix is_categorical_dtype for Sparse[category] Aug 19, 2020

Fix is_categorical_dtype for Sparse[category]

a86a7ff

dsaxton added Categorical Categorical Data Type Dtype Conversions Unexpected or buggy dtype conversions Sparse Sparse Data Type labels Aug 19, 2020

dsaxton added this to the 1.2 milestone Aug 19, 2020

Fixup

29241d2

jreback requested changes Aug 19, 2020

View reviewed changes

pandas/core/dtypes/common.py Outdated Show resolved Hide resolved

pandas/core/dtypes/common.py Outdated Show resolved Hide resolved

Fixes

b52a7ea

arw2019 reviewed Aug 19, 2020

View reviewed changes

pandas/core/dtypes/base.py Outdated Show resolved Hide resolved

jorisvandenbossche requested changes Aug 19, 2020

View reviewed changes

dsaxton removed this from the 1.2 milestone Aug 22, 2020

dsaxton added 3 commits September 15, 2020 14:05

Merge remote-tracking branch 'upstream/master' into sparse-categorica…

8d51429

…l-dtype

mypy

41d58a0

Type

6c482bc

Lint

582e6e4

jreback requested changes Sep 15, 2020

View reviewed changes

dsaxton added 3 commits September 15, 2020 18:23

Update

19b0aae

Merge remote-tracking branch 'upstream/master' into sparse-categorica…

1343a4a

…l-dtype

Fix

5a031aa

dsaxton changed the title ~~Fix is_categorical_dtype for Sparse[category]~~ BUG: Fix is_categorical_dtype for Sparse[category] Sep 16, 2020

Merge remote-tracking branch 'upstream/master' into sparse-categorica…

a799d73

…l-dtype

jreback added this to the 1.2 milestone Sep 19, 2020

jreback removed this from the 1.2 milestone Sep 19, 2020

jorisvandenbossche reviewed Sep 19, 2020

View reviewed changes

doc/source/whatsnew/v1.2.0.rst Outdated Show resolved Hide resolved

Update doc/source/whatsnew/v1.2.0.rst

a4388d3

Co-authored-by: Joris Van den Bossche <[email protected]>

dsaxton closed this Sep 19, 2020

dsaxton deleted the sparse-categorical-dtype branch September 19, 2020 16:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Fix is_categorical_dtype for Sparse[category] #35797

BUG: Fix is_categorical_dtype for Sparse[category] #35797

dsaxton commented Aug 19, 2020

jreback left a comment

dsaxton commented Aug 19, 2020

dsaxton commented Aug 19, 2020 •

edited

Loading

jorisvandenbossche left a comment

dsaxton commented Aug 19, 2020 •

edited

Loading

jorisvandenbossche commented Aug 20, 2020

dsaxton commented Aug 20, 2020

simonjayhawkins commented Aug 21, 2020

dsaxton commented Sep 15, 2020

pep8speaks commented Sep 15, 2020 •

edited

Loading

jreback Sep 15, 2020

dsaxton Sep 16, 2020 •

edited

Loading

jreback commented Sep 19, 2020

jorisvandenbossche commented Sep 19, 2020

dsaxton commented Sep 19, 2020

BUG: Fix is_categorical_dtype for Sparse[category] #35797

BUG: Fix is_categorical_dtype for Sparse[category] #35797

Conversation

dsaxton commented Aug 19, 2020

jreback left a comment

Choose a reason for hiding this comment

dsaxton commented Aug 19, 2020

dsaxton commented Aug 19, 2020 • edited Loading

jorisvandenbossche left a comment

Choose a reason for hiding this comment

dsaxton commented Aug 19, 2020 • edited Loading

jorisvandenbossche commented Aug 20, 2020

dsaxton commented Aug 20, 2020

simonjayhawkins commented Aug 21, 2020

dsaxton commented Sep 15, 2020

pep8speaks commented Sep 15, 2020 • edited Loading

Comment last updated at 2020-09-19 13:01:13 UTC

jreback Sep 15, 2020

Choose a reason for hiding this comment

dsaxton Sep 16, 2020 • edited Loading

Choose a reason for hiding this comment

jreback commented Sep 19, 2020

jorisvandenbossche commented Sep 19, 2020

dsaxton commented Sep 19, 2020

dsaxton commented Aug 19, 2020 •

edited

Loading

dsaxton commented Aug 19, 2020 •

edited

Loading

pep8speaks commented Sep 15, 2020 •

edited

Loading

dsaxton Sep 16, 2020 •

edited

Loading