Call cost floor + FP16 speed boost / old GPUs #245
blefaudeux
started this conversation in
General
Replies: 2 comments 2 replies
-
Hmm, how are you measuring this? I tried a vector addition with tiny vectors
and saw that torch and triton had the same overhead:
|
Beta Was this translation helpful? Give feedback.
2 replies
-
Changing the benchmark method (outside of torch.utils, and confirmed by real life measurements) solves both observations, fp16 gets a significant boost even on old HW and no call cost |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Poking around the fused softmax tutorial, and extending it a little to make it comparable to torch.softmax() (adding autograd mostly), I'm trying to get an understanding of how the perf compares to the very optimized Pytorch CUDA kernels, across a couple of axes. I'm curious to get some context around two observations, which stem from micro-benchmarks on a P100 (arguably a little old):
there seems to be a time floor to any call to a triton kernel, I'm seeing around 70us, even if this kernel was called just before and there's no autotune. Is that expected ? The call time for pytorch kernels can go much lower, towards a couple of us
using the same softmax kernel on fp16 data requires very little change (see tl.exp() and torch.float16 crash unceremoniously #241), but it does not bring any speed benefits. This is somewhat expected compute-wise on a P100, but I would expect some bandwidth benefits, and testing against pytorch this does show up (ie. torch.softmax() on fp16 data gets to be a little faster than fp32 on P100). Is that expected, or could that be that some memory allocations are hardcoded to fp32 within the IR for instance ?
Thanks ! cc @ptillet
Beta Was this translation helpful? Give feedback.
All reactions