An ASIC will be 4 times more efficient with these two operations because a, b are 32-bit integers:
case 1: return a * b;
case 2: return mul_hi(a, b);
32-bit integer multiplications are inefficient on GPUs because GPUs only have 24-bit wide data path for multiplication. 32-bit MUL is 4 times slower than 24-bit MUL. It's better to use mul24 here.
Side note: it's a shame that OpenCL still doesn't have mul24_hi, but CUDA has it.
An ASIC will be 4 times more efficient with these two operations because a, b are 32-bit integers:
32-bit integer multiplications are inefficient on GPUs because GPUs only have 24-bit wide data path for multiplication. 32-bit MUL is 4 times slower than 24-bit MUL. It's better to use mul24 here.
Side note: it's a shame that OpenCL still doesn't have mul24_hi, but CUDA has it.