np.issubdtype does not work well with pandas #9474

joshthoward · 2017-07-27T14:30:01Z

For example, below I am trying to use np.issubdtype to check if a pandas categorical is a subtype of a np.number. IMO this should definitely be false rather than throwing an error.

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-99-4ba76ca46d4a> in <module>()
     32 
---> 33 np.issubdtype(df.dtypes[1], np.number)

/Users/jhoward/homebrew/lib/python2.7/site-packages/numpy/core/numerictypes.pyc in issubdtype(arg1, arg2)
    753     """
    754     if issubclass_(arg2, generic):
--> 755         return issubclass(dtype(arg1).type, arg2)
    756     mro = dtype(arg2).type.mro()
    757     if len(mro) > 1:

TypeError: data type not understood

This issue should be fixed on numpy's side because external types should by default compare as false unless the external package developers have explicitly used the numpy type system.

If this is a reasonable update, I would be glad to fix this myself.

The text was updated successfully, but these errors were encountered:

shoyer · 2017-07-27T15:43:07Z

Please see #5329 and put your proposal in the context of the concerns raised there. Thanks!

eric-wieser · 2017-07-27T15:55:19Z

Is df.dtypes[1] an instance of np.dtype? If not, then expecting this to work is like expecting issubclass("I'm a string", int) to not raise

joshthoward · 2017-07-27T19:25:09Z

In my example, df.dtypes[1] is an instance of pandas.core.dtypes.dtypes.CategoricalDtype, which extends the python built in type. So my argument is not that issubclass("I'm a string", int) should work, but that issubclass(str, int) should work. The former seems to be what is discussed in #5329 while the latter returns false as it should rather than throwing an error.

edit: I understand that the other issue was actually about parsing strings of dtypes, not somehow knowing to compare the type of the object passed. My phrasing was inaccurate.

eric-wieser · 2017-07-27T19:42:41Z

pandas.core.dtypes.dtypes.CategoricalDtype, which extends the python built in type

This doesn't sound correct to me - are you sure it's not an instance of the built in type?

I maintain that the behaviour is consistent with issubclass:

issubclass(x, y) throws an exception unless isinstance(x, type) and isinstance(y, type)
issubdtype(x, y) throws an exception unless isinstance(x, dtype) and isinstance(y, dtype)

The issue here, then, is that pandas.core.dtypes.dtypes.CategoricalDtype does not subclass np.dtype (am I right in saying this is the case?). This is probably because numpy makes that hard, which is something we should work to fix.

joshthoward · 2017-07-27T19:57:28Z

Actually we are both wrong.. From https://github.com/pandas-dev/pandas/blob/master/pandas/core/dtypes/dtypes.py, CategoricalDtype extends ExtensionDtype which is as they say made to mimic a numpy dtype. The type of CategoricalDtype is explicitly set to CategoricalDtypeType which extends from the python built in type.

My issue is still that when using issubdtype to compare a valid built in type with a np.dtype an exception should not be thrown. Or at the very least something other than TypeError. A type error is also thrown when the user does something like issubdtype(1, np.int).

eric-wieser · 2017-07-27T20:19:51Z

to compare a valid build in type with a np.dtype and exception should not be thrown

Sorry, what was your example for this?

eric-wieser · 2017-07-27T20:27:24Z

ExtensionDtype which is as they say made to mimic a numpy dtype

I think the model here is wrong. They shouldn't attempt to me mimicing a dtype, because in numpy there are no dtype subclasses. Instead, they should be mimicing / inheriting from np.generic. Unfortunately, this is currently impossible.

It seems the way forward might be for numpy to provide a np.pygeneric, which is a python-subclassable version of np.generic, and registers a dtype somehow

njsmith · 2017-07-27T21:10:24Z

@Jo8hua: CategoricalDtype objects are not dtype objects, despite the name – they aren't implemented as dtypes, and their semantics are incompatible with dtype semantics (i.e. they're not even "duck dtypes"). That's exactly what the discussion in #5329 is about :-)

@eric-weiser: categoricals just don't fit well into the current dtype design because they're parametric (to have a complete categorical you need the list of levels), and there's no particular reason for them to have a scalar type at all except for the weird way issubdtype works. I think the goal is to refactor how dtypes work so that you actually can subclass dtype to make a new type of dtype.

joshthoward · 2017-07-28T14:23:44Z

@eric-wieser: an example would be issubdtype(str, np.int32) evaluating to false. I don't see this as an error as str is a valid python type, but not a subtype of np.int. I am advocating that issubdtype(x, y) should throw an exception unless isinstance(x, type) and isinstance(y, dtype). I don't know enough about dtype internals to discuss whether they need to be refactored as a whole, but my issue is far removed from that.

I am not a pandas fan, and as so I don't really care about CategoricalDtypes... However, it brings up cases where numpy dtypes are used in conjunction with the built in python type system. If a programmer asks the question "is int a subtype of int32?", the answer is an unambiguous no. The question "is str a subtype of int32?", is also an unambiguous no. The question "is 'Here is a string' a subtype of int32?", is a dumb question that the programmer didn't mean to ask and the answer should be TypeError.

@njsmith: After rereading #5329 I am not seeing how duck dtypes are the point of that discussion.

njsmith · 2017-07-28T14:58:20Z

I am advocating that issubdtype(x, y) should throw an exception unless isinstance(x, type) and isinstance(y, dtype).

This doesn't make any sense, because types and dtypes are totally different things. What issubdtype requires is that both objects should be instances of dtype, or else coerceable into instances of dtype. issubdtype(str, np.int32) is shorthand for issubdtype(dtype(str), dtype(np.int32)). str-the-python-type has almost nothing to do with this – it's just being used as a kind of flag to pick (one of the) the numpy string dtype(s).

joshthoward · 2017-07-28T17:56:58Z

@njsmith Could you avoid your comments such as "This doesn't make any sense...". That is abrasive and wrong. My statement makes perfect sense even though you might not agree that it is a beneficial change to the code.

To answer your rebuttal, it seems like you are using the way that the code currently works to explain why my API change shouldn't be accepted. I don't believe that this is a valid argument because IMO the way the code currently works is not user friendly.

shoyer · 2017-07-28T18:25:10Z

To answer your rebuttal, it seems like you are using the way that the code currently works to explain why my API change shouldn't be accepted

In the context of NumPy, this is actually totally valid. We take backwards compatibility very seriously.

I agree with @njsmith that your proposal "issubdtype(x, y) should throw an exception unless isinstance(x, type) and isinstance(y, dtype)" is a non-starter. It would be a reasonable proposal if we were starting from scratch, but this function has always coerced its arguments to np.dtype objects and this is clearly documented. I don't see how we could possibly change this without breaking lots of code.

If you're interested in a function that works in the way you describe it will need to have a new name. np.issubdtype is taken, for better or worse.

njsmith · 2017-07-28T18:25:59Z

I'm sorry for my tone there; I was responding in haste.

What I mean though is that this is the same as the issubclass("string", int) case, which (as you point out) doesn't make sense as a piece of code because strings are not the type of thing that can have a subclass relationship. Similarly, types are not the sort of thing that can have a subdtype relationship; only dtypes can have that.

eric-wieser · 2017-07-28T18:27:07Z

Really though, issubdtype is just a thin (and broken #9480) wrapper around issubclass that does some dtype coercion. If you don't want the coercion, then just use issubclass(some_type, some_dtype.type).

eric-wieser · 2017-08-01T14:39:44Z

Hang on a second, this doesn't help:

issubdtype(x, y) should throw an exception unless isinstance(x, type) and isinstance(y, dtype)

In your case, x is not an instance of type, it is an instance of pandas.types.dtypes.CategoricalDtype. So under your proposal, it would continue to error.

iliatimofeev mentioned this issue Jun 9, 2018

Fix tz datetime vega/altair#928

Merged

mattip added 15 - Discussion component: numpy.dtype labels Aug 14, 2018

chrismilleruk mentioned this issue Jan 13, 2021

cont_cat_split does not support pandas types / fastbook 09_tabular.ipynb fails fastai/fastai#3156

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

np.issubdtype does not work well with pandas #9474

np.issubdtype does not work well with pandas #9474

joshthoward commented Jul 27, 2017

shoyer commented Jul 27, 2017

eric-wieser commented Jul 27, 2017

joshthoward commented Jul 27, 2017 •

edited

Loading

eric-wieser commented Jul 27, 2017 •

edited

Loading

joshthoward commented Jul 27, 2017 •

edited

Loading

eric-wieser commented Jul 27, 2017

eric-wieser commented Jul 27, 2017

njsmith commented Jul 27, 2017

joshthoward commented Jul 28, 2017

njsmith commented Jul 28, 2017

joshthoward commented Jul 28, 2017

shoyer commented Jul 28, 2017

njsmith commented Jul 28, 2017

eric-wieser commented Jul 28, 2017

eric-wieser commented Aug 1, 2017 •

edited

Loading

np.issubdtype does not work well with pandas #9474

np.issubdtype does not work well with pandas #9474

Comments

joshthoward commented Jul 27, 2017

shoyer commented Jul 27, 2017

eric-wieser commented Jul 27, 2017

joshthoward commented Jul 27, 2017 • edited Loading

eric-wieser commented Jul 27, 2017 • edited Loading

joshthoward commented Jul 27, 2017 • edited Loading

eric-wieser commented Jul 27, 2017

eric-wieser commented Jul 27, 2017

njsmith commented Jul 27, 2017

joshthoward commented Jul 28, 2017

njsmith commented Jul 28, 2017

joshthoward commented Jul 28, 2017

shoyer commented Jul 28, 2017

njsmith commented Jul 28, 2017

eric-wieser commented Jul 28, 2017

eric-wieser commented Aug 1, 2017 • edited Loading

joshthoward commented Jul 27, 2017 •

edited

Loading

eric-wieser commented Jul 27, 2017 •

edited

Loading

joshthoward commented Jul 27, 2017 •

edited

Loading

eric-wieser commented Aug 1, 2017 •

edited

Loading