Skip to content

Add specifications for set functions #25

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Sep 14, 2020
Merged

Add specifications for set functions #25

merged 5 commits into from
Sep 14, 2020

Conversation

kgryte
Copy link
Contributor

@kgryte kgryte commented Aug 26, 2020

This PR

  • adds specifications for functions for creating and operating on sets.
  • is derived from comparing API signatures across array libraries.

Notes

  • NumPy includes additional set functions (e.g., in1d, intersect1d, isin, setdiff1d, union1d, setxor1d); however, these functions are not widely implemented by other analyzed array libraries ( see here) and, thus, were not included in this initial specification. Should additional set functions be necessary, they can be proposed in follow-up proposals.

Questions (and notes)

  • unique

    • support for the axis keyword, while present in API signatures, is not common (currently, NumPy, MXNet, and Torch support; while CuPy, Dask, JAX, and TensorFlow do not). Furthermore, downstream usage of the axis keyword is uncommon (e.g., in the record data, we only see one invocation in which axis is specified and that was by SciPy). Accordingly, the proposed spec does not currently accommodate non-flattened multi-dimensional arrays. If a mult-dimensional array is provided, the proposed spec states that implementations should flatten the array before determining unique values. However, the proposed spec does not preclude the axis keyword from being added in a future spec revision.

    • should the unique elements be returned in sorted order (NumPy et al sort by default, while Torch has a sorted keyword and TensorFlow does not support sorting, instead choosing to preserve order of occurrence)? Or should this be implementation defined? Or should there be a keyword argument to require the array containing the unique elements to be sorted?

      An argument for an optional keyword argument is that some implementations may choose an alternative data structure to simultaneously sort while determining unique elements, so pushing sort order userland may be undesirable. Note, however, that if an optional keyword argument indicating whether to sort the output is desired, we'd most likely need to support a direction keyword argument for reasons discussed here.

      An additional argument for returning unsorted unique elements is that preserving the order of occurrence is sometimes desirable (e.g., take the first n unique elements from an array x).

@rgommers
Copy link
Member

Thanks @kgryte. I agree with leaving out axis.

Regarding sorting, that's an annoying one. Even if TensorFlow would add a sorting option, it probably couldn't easily default to sorted=True. The tf.experimental.numpy namespace doesn't have unique yet, maybe that would have sorted return values for compatibility?

NumPy et al. may add sorted=False at some point, I know there's interest in that - the implementation can then be significantly faster.

The sort order is important though, it's not uncommon to use the returned values for plotting for example. And I imagine there's other code as well where it matters. I'd lean towards requiring sorted output, but would be good to get input from @alextp on if this is okay for TensorFlow.

In terms of signatures, PyTorch misses return_index (feature request at pytorch/pytorch#36748). TensorFlow missed all three return_xxx keywords; feature request for return_index at tensorflow/tensorflow#4614 was auto-closed, but people do want it it seems - and given that there's a unique_with_counts already, should be easy to do.

@rgommers
Copy link
Member

@kgryte to move this PR forward in the absence of more feedback now, I propose:

  • add sorted=True keyword
  • open an issue saying that this needs a closer look
  • add a TODO in the specification pointing to the issue

That way we can land it, and get input from a broader group later on.

@kgryte
Copy link
Contributor Author

kgryte commented Sep 14, 2020

@rgommers Added keyword, opened issue, and added a TODO.

Copy link
Member

@rgommers rgommers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, in it goes then.

@rgommers rgommers merged commit f14aebe into master Sep 14, 2020
@rgommers rgommers deleted the set-functions branch September 14, 2020 09:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants