Skip to content

np.issubdtype does not work well with pandas #9474

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
joshthoward opened this issue Jul 27, 2017 · 15 comments
Open

np.issubdtype does not work well with pandas #9474

joshthoward opened this issue Jul 27, 2017 · 15 comments

Comments

@joshthoward
Copy link

For example, below I am trying to use np.issubdtype to check if a pandas categorical is a subtype of a np.number. IMO this should definitely be false rather than throwing an error.

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-99-4ba76ca46d4a> in <module>()
     32 
---> 33 np.issubdtype(df.dtypes[1], np.number)

/Users/jhoward/homebrew/lib/python2.7/site-packages/numpy/core/numerictypes.pyc in issubdtype(arg1, arg2)
    753     """
    754     if issubclass_(arg2, generic):
--> 755         return issubclass(dtype(arg1).type, arg2)
    756     mro = dtype(arg2).type.mro()
    757     if len(mro) > 1:

TypeError: data type not understood

This issue should be fixed on numpy's side because external types should by default compare as false unless the external package developers have explicitly used the numpy type system.

If this is a reasonable update, I would be glad to fix this myself.

@shoyer
Copy link
Member

shoyer commented Jul 27, 2017

Please see #5329 and put your proposal in the context of the concerns raised there. Thanks!

@eric-wieser
Copy link
Member

Is df.dtypes[1] an instance of np.dtype? If not, then expecting this to work is like expecting issubclass("I'm a string", int) to not raise

@joshthoward
Copy link
Author

joshthoward commented Jul 27, 2017

In my example, df.dtypes[1] is an instance of pandas.core.dtypes.dtypes.CategoricalDtype, which extends the python built in type. So my argument is not that issubclass("I'm a string", int) should work, but that issubclass(str, int) should work. The former seems to be what is discussed in #5329 while the latter returns false as it should rather than throwing an error.

edit: I understand that the other issue was actually about parsing strings of dtypes, not somehow knowing to compare the type of the object passed. My phrasing was inaccurate.

@eric-wieser
Copy link
Member

eric-wieser commented Jul 27, 2017

pandas.core.dtypes.dtypes.CategoricalDtype, which extends the python built in type

This doesn't sound correct to me - are you sure it's not an instance of the built in type?


I maintain that the behaviour is consistent with issubclass:

  • issubclass(x, y) throws an exception unless isinstance(x, type) and isinstance(y, type)
  • issubdtype(x, y) throws an exception unless isinstance(x, dtype) and isinstance(y, dtype)

The issue here, then, is that pandas.core.dtypes.dtypes.CategoricalDtype does not subclass np.dtype (am I right in saying this is the case?). This is probably because numpy makes that hard, which is something we should work to fix.

@joshthoward
Copy link
Author

joshthoward commented Jul 27, 2017

Actually we are both wrong.. From https://github.com/pandas-dev/pandas/blob/master/pandas/core/dtypes/dtypes.py, CategoricalDtype extends ExtensionDtype which is as they say made to mimic a numpy dtype. The type of CategoricalDtype is explicitly set to CategoricalDtypeType which extends from the python built in type.

My issue is still that when using issubdtype to compare a valid built in type with a np.dtype an exception should not be thrown. Or at the very least something other than TypeError. A type error is also thrown when the user does something like issubdtype(1, np.int).

@eric-wieser
Copy link
Member

to compare a valid build in type with a np.dtype and exception should not be thrown

Sorry, what was your example for this?

@eric-wieser
Copy link
Member

ExtensionDtype which is as they say made to mimic a numpy dtype

I think the model here is wrong. They shouldn't attempt to me mimicing a dtype, because in numpy there are no dtype subclasses. Instead, they should be mimicing / inheriting from np.generic. Unfortunately, this is currently impossible.

It seems the way forward might be for numpy to provide a np.pygeneric, which is a python-subclassable version of np.generic, and registers a dtype somehow

@njsmith
Copy link
Member

njsmith commented Jul 27, 2017

@Jo8hua: CategoricalDtype objects are not dtype objects, despite the name – they aren't implemented as dtypes, and their semantics are incompatible with dtype semantics (i.e. they're not even "duck dtypes"). That's exactly what the discussion in #5329 is about :-)

@eric-weiser: categoricals just don't fit well into the current dtype design because they're parametric (to have a complete categorical you need the list of levels), and there's no particular reason for them to have a scalar type at all except for the weird way issubdtype works. I think the goal is to refactor how dtypes work so that you actually can subclass dtype to make a new type of dtype.

@joshthoward
Copy link
Author

@eric-wieser: an example would be issubdtype(str, np.int32) evaluating to false. I don't see this as an error as str is a valid python type, but not a subtype of np.int. I am advocating that issubdtype(x, y) should throw an exception unless isinstance(x, type) and isinstance(y, dtype). I don't know enough about dtype internals to discuss whether they need to be refactored as a whole, but my issue is far removed from that.

I am not a pandas fan, and as so I don't really care about CategoricalDtypes... However, it brings up cases where numpy dtypes are used in conjunction with the built in python type system. If a programmer asks the question "is int a subtype of int32?", the answer is an unambiguous no. The question "is str a subtype of int32?", is also an unambiguous no. The question "is 'Here is a string' a subtype of int32?", is a dumb question that the programmer didn't mean to ask and the answer should be TypeError.

@njsmith: After rereading #5329 I am not seeing how duck dtypes are the point of that discussion.

@njsmith
Copy link
Member

njsmith commented Jul 28, 2017

I am advocating that issubdtype(x, y) should throw an exception unless isinstance(x, type) and isinstance(y, dtype).

This doesn't make any sense, because types and dtypes are totally different things. What issubdtype requires is that both objects should be instances of dtype, or else coerceable into instances of dtype. issubdtype(str, np.int32) is shorthand for issubdtype(dtype(str), dtype(np.int32)). str-the-python-type has almost nothing to do with this – it's just being used as a kind of flag to pick (one of the) the numpy string dtype(s).

@joshthoward
Copy link
Author

@njsmith Could you avoid your comments such as "This doesn't make any sense...". That is abrasive and wrong. My statement makes perfect sense even though you might not agree that it is a beneficial change to the code.

To answer your rebuttal, it seems like you are using the way that the code currently works to explain why my API change shouldn't be accepted. I don't believe that this is a valid argument because IMO the way the code currently works is not user friendly.

@shoyer
Copy link
Member

shoyer commented Jul 28, 2017

To answer your rebuttal, it seems like you are using the way that the code currently works to explain why my API change shouldn't be accepted

In the context of NumPy, this is actually totally valid. We take backwards compatibility very seriously.

I agree with @njsmith that your proposal "issubdtype(x, y) should throw an exception unless isinstance(x, type) and isinstance(y, dtype)" is a non-starter. It would be a reasonable proposal if we were starting from scratch, but this function has always coerced its arguments to np.dtype objects and this is clearly documented. I don't see how we could possibly change this without breaking lots of code.

If you're interested in a function that works in the way you describe it will need to have a new name. np.issubdtype is taken, for better or worse.

@njsmith
Copy link
Member

njsmith commented Jul 28, 2017

I'm sorry for my tone there; I was responding in haste.

What I mean though is that this is the same as the issubclass("string", int) case, which (as you point out) doesn't make sense as a piece of code because strings are not the type of thing that can have a subclass relationship. Similarly, types are not the sort of thing that can have a subdtype relationship; only dtypes can have that.

@eric-wieser
Copy link
Member

Really though, issubdtype is just a thin (and broken #9480) wrapper around issubclass that does some dtype coercion. If you don't want the coercion, then just use issubclass(some_type, some_dtype.type).

@eric-wieser
Copy link
Member

eric-wieser commented Aug 1, 2017

Hang on a second, this doesn't help:

issubdtype(x, y) should throw an exception unless isinstance(x, type) and isinstance(y, dtype)

In your case, x is not an instance of type, it is an instance of pandas.types.dtypes.CategoricalDtype. So under your proposal, it would continue to error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants