Skip to content

TRACKING: Review and possibly address changes regard new dtypes #16624

Open
@seberg

Description

@seberg

Since more such changes will inadvertently be added, there are a few changes that will happen with my array-coercion branch at this time, and just as much with other things later. These are things that we should review before 1.20, and decide on the exact course of action.

  1. np.array([np.float64(np.inf)], dtype=np.int64) as opposed to np.array(np.float64(np.inf), dtype=np.int64) used to use float(val) to assign the item. Which meant that bad float values (including out of bounds) would lead to errors. The new code will use normal casting logic in all cases, which uses C-casting and typically ends up at the minimum integer. The ideal solution will probably be to add warnings/errors also for casting in general, although we still need to decide how to do that best.

  2. The truthiness (and casting) of strings to bools is badly defined right now, see also BUG: truthiness of strings is arbitrary, context-dependent, and inconsistent #9875. The array coercion changes will have aligned some of these, but not necessarily for the better. We should ensure that the new behaviour is not worse than the old, and generally push forward with fixing the situation. (The question is how slow we have to take it?)

  3. The dtype discovery (mainly with respect to string length) is now improved. This means that object arrays being coerced are always inspected correctly. At the same time, the string length is now consistent, but always the normal casting version for numpy scalars (e.g. a float64). This only affects numpy scalars and only array coercion. There is probably no need to do anything here, the behaviour was never consistent and the new behaviour errs on the safe side and is generally better.

  4. Not an issue as such (low priority): The coercion cache is also used in some cases where the array is not coerced immediately. This wastes some memory, and the code could be improved to allow doing no caching.

  5. We should possibly anticipate (my preferred solution is an error), dtypes which are not conserved during array object creation. These should only be dtypes with subarrays (and no structure). That should be pretty much impossible, but... See also BUG: Ensure PyArray_FromScalar always returns the requested dtype #15471.

  6. Consider simply making all DType classes HeapTypes. This would also allow to give them a repr, which may be copy-pastable in the future (if we stick with the square bracket notation). since it allows a period in the repr to make it: np.dtype[np.float64].

  7. Consider implementing dtype.type to look up the class attribute DType.type instead. Right now these must always match, but we must retain the C-side dtype->scalar_type slot for backward compatibility.

  8. The Maping pytype -> DType is currently strong. This is an issue in theory for support of dynamically created pytype <-> DType pairs. The solution to this is to a have weak mapping but set a pytype.__associated_numpy_dtype__ slot when registering, which should be unproblematic for HeapTypes. For static types everything has to be immortal, but we can allow that path for NumPy itself. Anyone wanting to add a DType for another pytype (with automatic mapping/dtype discovery), has to make sure that NumPy can fill that attribute on the pytype (which is possible also for static types of course).

EDIT: The more important issues have been cleared out. The boolean one is still there, but will only affect numpy strings, and I think that is fine. (Of course we still need to clean up strings, but its so strange...)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions