Skip to content

[Python] Expose the device interface through the Arrow PyCapsule protocol #38325

Closed
@jorisvandenbossche

Description

@jorisvandenbossche

We added a new protocol exposing the C Data Interface (schema, array and stream) in Python through PyCapsule objects and new dunder methods __arrow_c_schema/array/stream__ (#35531 / #37797).

We recently also expanded the C Data Interface with device capabilities: https://arrow.apache.org/docs/dev/format/CDeviceDataInterface.html (#34972).

The currently merged PyCapsule protocol uses the stable non-device interface, but so the question is how to integrate the device version in the protocol in order to expose the C Device Data Interface in Python as well. Some options:

  1. Only support the device versions going forward (like currently only the cpu version is supported, i.e. the returned capsules always contain a device array/stream).
    (this is a backwards incompatible change, but given we labeled the protocol as experimental, we can still make such changes if we think this is the best long-term option. The capsule names would reflect this change, thus this will generate a proper python error if a consumer or producer would not yet have been updated, and we can actually first deprecate the non-device support in pyarrow before removing it. All to say that AFAIU this is perfectly possible if we want it.)

  2. Add separate dunder methods __arrow_c_device_array__ and __arrow_c_device_stream__, and then it is up to the producer to implement those dunders if they can (and we can strongly recommend doing that, also for CPU-only libraries), and to consumers to check which ones are present.

  3. Allow the consumer to request a device array with some keyword (eg __array_c_array__(device=True)), which gives the consumer the option to request it while also still giving the producer the possibility to raise an error if they don't (yet) support the device version.

  4. Support both options in the current methods without keyword, i.e. allow __arrow_c_array__ to return both a "arrow_array" or "arrow_device_array" capsule (and their capsule name distinguishes both). With the recommendation to always return a device version if you can, but allowing producers to still return a cpu version if they don't support the device one. This only gives some flexibility to the producer, and no control to the consumer to request the CPU version (so this essentially expects that all consumers will handle the device version)

Options 2/3/4 are probably just variants of how to expose both interfaces, and thus the main initial question is whether we want to, long term, move towards an ecosystem where everyone uses the C Device Data Interface, or to keep using both interfaces side by side (as the main interchange mechanism, I mean, the device interface of course still embeds the standard struct).

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions