Skip to content

TRACKING: Review and possibly address changes regard new dtypes #16624

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
seberg opened this issue Jun 16, 2020 · 2 comments
Open

TRACKING: Review and possibly address changes regard new dtypes #16624

seberg opened this issue Jun 16, 2020 · 2 comments

Comments

@seberg
Copy link
Member

seberg commented Jun 16, 2020

Since more such changes will inadvertently be added, there are a few changes that will happen with my array-coercion branch at this time, and just as much with other things later. These are things that we should review before 1.20, and decide on the exact course of action.

  1. np.array([np.float64(np.inf)], dtype=np.int64) as opposed to np.array(np.float64(np.inf), dtype=np.int64) used to use float(val) to assign the item. Which meant that bad float values (including out of bounds) would lead to errors. The new code will use normal casting logic in all cases, which uses C-casting and typically ends up at the minimum integer. The ideal solution will probably be to add warnings/errors also for casting in general, although we still need to decide how to do that best.

  2. The truthiness (and casting) of strings to bools is badly defined right now, see also BUG: truthiness of strings is arbitrary, context-dependent, and inconsistent #9875. The array coercion changes will have aligned some of these, but not necessarily for the better. We should ensure that the new behaviour is not worse than the old, and generally push forward with fixing the situation. (The question is how slow we have to take it?)

  3. The dtype discovery (mainly with respect to string length) is now improved. This means that object arrays being coerced are always inspected correctly. At the same time, the string length is now consistent, but always the normal casting version for numpy scalars (e.g. a float64). This only affects numpy scalars and only array coercion. There is probably no need to do anything here, the behaviour was never consistent and the new behaviour errs on the safe side and is generally better.

  4. Not an issue as such (low priority): The coercion cache is also used in some cases where the array is not coerced immediately. This wastes some memory, and the code could be improved to allow doing no caching.

  5. We should possibly anticipate (my preferred solution is an error), dtypes which are not conserved during array object creation. These should only be dtypes with subarrays (and no structure). That should be pretty much impossible, but... See also BUG: Ensure PyArray_FromScalar always returns the requested dtype #15471.

  6. Consider simply making all DType classes HeapTypes. This would also allow to give them a repr, which may be copy-pastable in the future (if we stick with the square bracket notation). since it allows a period in the repr to make it: np.dtype[np.float64].

  7. Consider implementing dtype.type to look up the class attribute DType.type instead. Right now these must always match, but we must retain the C-side dtype->scalar_type slot for backward compatibility.

  8. The Maping pytype -> DType is currently strong. This is an issue in theory for support of dynamically created pytype <-> DType pairs. The solution to this is to a have weak mapping but set a pytype.__associated_numpy_dtype__ slot when registering, which should be unproblematic for HeapTypes. For static types everything has to be immortal, but we can allow that path for NumPy itself. Anyone wanting to add a DType for another pytype (with automatic mapping/dtype discovery), has to make sure that NumPy can fill that attribute on the pytype (which is possible also for static types of course).

EDIT: The more important issues have been cleared out. The boolean one is still there, but will only affect numpy strings, and I think that is fine. (Of course we still need to clean up strings, but its so strange...)

@seberg
Copy link
Member Author

seberg commented Sep 17, 2020

On thing to discuss maybe, mainly due to pandas having issues with it in pandas-dev/pandas#35481

Basically, before the updates here, we had the following behaviour (which also affects np.float64(np.nan) and datetimes NaT):

np.array(np.uint64(2**63 + 1), dtype="int64")  # works
np.array([np.uint64(2**63 + 1)], dtype="int64")  # raises error

Current master picks the "works" path for all things. This makes sense, because the scalars are typed and there is no reason they should behave differently from an array.
Now, we could always make this an error (also for arrays), but that may be a big step, and I assume we want to warn first or even only go with a warning to begin with.

On current master we can very easily change this to go with the error in all cases if it helps out pandas, but that might also create issues. It will probably require a lot of acrobatics to retain both (conflicting) cases though.

@seberg seberg added triage review Issue/PR to be discussed at the next triage meeting and removed triage review Issue/PR to be discussed at the next triage meeting labels Sep 17, 2020
@charris charris removed this from the 1.20.0 release milestone Nov 25, 2020
@charris
Copy link
Member

charris commented Nov 25, 2020

Cleared the milestone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants