Skip to content

Commit 64145ba

Browse files
authored
Merge pull request #2 from VoVAllen/patch-1
Update 20191016-dlpack-support.md with VoVAllen's changes
2 parents a767ccc + 1b0a94d commit 64145ba

File tree

1 file changed

+22
-6
lines changed

1 file changed

+22
-6
lines changed

rfcs/20191016-dlpack-support.md

Lines changed: 22 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -86,7 +86,7 @@ Proposed code of converting a Tensorflow tensor to Torch tensor using DLPack nat
8686
```python
8787
import numpy as np
8888
import tensorflow as tf
89-
import tensorflow.???.dlpack as tfdlpack
89+
import tensorflow.experimental.dlpack as tfdlpack
9090
import torch.utils.dlpack as thdlpack
9191

9292

@@ -99,7 +99,27 @@ t3 = tfdlpack.from_dlpack(dlpack) # dlpack -> tf tensor
9999
print(t3)
100100
```
101101

102-
Package implementation details:
102+
Proposed API implementation details:
103+
There two critical parts for this API:
104+
1. Memory usability on async device (to_dlpack)
105+
As mentioned by @alextp
106+
> TF does not use cudamalloc to allocate memory but its own allocator whose internal state is stored on the CPU and matches the head of TF's compute stream, so we need to sync TF's stream before the memory is usable from dlpack and similarly sync other cuda streams before memory is made usable by TF tensors (and similarly we need to sync the streams when trying to free the buffers).
107+
Here we decide to manunally sync the device when exporting TF tensor to dlpack. The sync behavior is done in the `TFE_TensorHandleDevicePointer` API, which returns the pointer to the underlying memory.
108+
109+
2. Memory management (avoid leak) (to_dlpack/from_dlpack)
110+
As the design of dlpack, the framework constructing tensor from dlpack is responsible to call the dlpack's deleter, which is usually dereferencing the underlying buffer, when destructing the constructed tensor.
111+
For `from_dlpack`, a deleter function is registered when constructing the TF tensor, and would be called upon destruction.
112+
For `to_dlpack`, the dlpack data structure will hold a reference (by `TensorReference`) to the underlying buffer, and `unref` it in the dlpack's deleter function.
113+
114+
115+
## Questions and Discussion Topics
116+
117+
https://github.com/tensorflow/tensorflow/issues/29039#issuecomment-527520270 Outlines the key issues that need to be addressed, namely that a synch is required to ensure the tensor information is valid. Supporting \_\_cuda_array_interface\_\_ is another option as well, although cuPy and cuDF have opted to support both and ideally Tensorflow would as well.
118+
119+
## Reference
120+
121+
### tfdlpack package implementation detail
122+
103123
The first design consideration is that we want to avoid any modification to the main Tensorflow library, so to get around the potential long delay of PR, code review, and release cycle of Tensorflow main package. Inspired by the solution from https://github.com/tobegit3hub/tftvm, we decide to implement the functionality as two custom tensor ops: to_dlpack and from_dlpack.
104124

105125
Besides, we want this feature to be plugged into other projects quite easily. For example, any project that relies on this feature is able to run without compiling against Tensorflow's header files. Not only that an extra dependency usually means extra effort, but also that such maintenance is repetitive and should be handled by the feature developer (i.e., us) alone. To this end, we have an idea of releasing it as a python package. However, the question is how to invoke the two custom tensor ops in python? The challenge is that Tensorflow's custom op interface has a limited support of argument and return types, while to_dlpack and from_dlpack should have an argument/return type of DLPack object. We work around this by encoding the address of an DLPack object as an integer, so it can be accepted/returned by the custom op interface. Then, we decode it in python or C depending on whether we return it (to_dlpack) or consume it (from_dlpack).
@@ -108,7 +128,3 @@ Finally, to achieve the maximal efficiency, we want the conversion happens witho
108128

109129
For to_dlpack, the returned DLPack tensor shares the same memory address of the input Tensorflow tensor and holds a reference to it. Upon the destruction of the DLPack tensor, it will dereference the Tensorflow tensor, so it can be collected by Tensorflow's memory management. (inspired by PyTorch's DLPack implementation).
110130
For from_dlpack, it first creates an allocator object (subclass Tensorflow's allocator interface) that holds the reference to the DLPack tensor. The AllocateRaw function directly returns the memory it holds without creating any new buffer. Upon destruction, the DeallocateRaw function just calls the deletor of the DLPack tensor. (inspired by Tensorflow's immutable_constant_op).
111-
112-
## Questions and Discussion Topics
113-
114-
https://github.com/tensorflow/tensorflow/issues/29039#issuecomment-527520270 Outlines the key issues that need to be addressed, namely that a synch is required to ensure the tensor information is valid. Supporting \_\_cuda_array_interface\_\_ is another option as well, although cuPy and cuDF have opted to support both and ideally Tensorflow would as well.

0 commit comments

Comments
 (0)