-
Notifications
You must be signed in to change notification settings - Fork 5.9k
Closed
Description
In ForRange struct, thread size seems to be assigned arbitrary value, the value is not multiple of the warp size.
As I read and heard that the thread size assigned to a block should be always multiple of the warp size(32), otherwise not only the remaining part of the warp goes unused and the performance is dropped too since bad memory coalescing. But I didn't find a comparative experiment on this.
Paddle/paddle/platform/for_range.h
Lines 65 to 75 in 7bf47ea
| constexpr size_t num_threads = 1024; | |
| int block_size = limit_ <= num_threads ? limit_ : num_threads; | |
| int grid_size = (limit_ + num_threads - 1) / num_threads; | |
| if (grid_size == 1) { | |
| ForRangeElemwiseOpGridIsOne<<<1, block_size, 0, dev_ctx_.stream()>>>( | |
| func); | |
| } else { | |
| ForRangeElemwiseOp<<<grid_size, block_size, 0, dev_ctx_.stream()>>>( | |
| func, limit_); | |
| } |
Metadata
Metadata
Assignees
Labels
No labels