StreamFlags::NON_BLOCKING is unsound because of fringe asynchronous memory copy behavior in CUDA

Streams with `NON_BLOCKING` exhibit very confusing and very dangerous behavior with regards to memcpy due to odd CUDA semantics, per the [driver API docs](https://docs.nvidia.com/cuda/cuda-driver-api/api-sync-behavior.html#api-sync-behavior):

> For transfers from pageable host memory to device memory, a stream sync is performed before the copy is initiated. The function will return once the pageable buffer has been copied to the staging memory for DMA transfer to device memory, but the DMA to final destination may not have completed.

Because `NON_BLOCKING` streams do not synchronize with the null (default) stream, this leads to potential race conditions. NVIDIA appears to be aware of this issue, but in the mean time, it may be beneficial to implicitly disable `NON_BLOCKING` for now. Especially since cust does not expose stream ordered memory allocation.

This is what appears to be happening in the `add` example sometimes not doing anything on certain systems.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

StreamFlags::NON_BLOCKING is unsound because of fringe asynchronous memory copy behavior in CUDA #15

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

StreamFlags::NON_BLOCKING is unsound because of fringe asynchronous memory copy behavior in CUDA #15

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions