-
Notifications
You must be signed in to change notification settings - Fork 5.9k
Add fp16 mul op support and bind paddle fp16 to numpy fp16 #9017
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
paddle/fluid/operators/mul_op.cc
Outdated
| both input tensors to float16 data types if needed and use the float16 | ||
| compute kernel to generate the output tensor also in float16 data type. | ||
| This attribute is by default false and normally would only be set to | ||
| true in inference stage for performance optimization. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Volta generation of GPUs introduces Tensor Cores, which provide 8x more throughput than single precision math pipelines. Each Tensor Core performs D = A x B + C, where A, B, C and D are matrices. A and B are half-precision 4x4 matrices, whereas D and C can be either half or single precision 4x4 matrices. In other words, Tensor Core math can accumulate half precision products into either single or half-precision outputs.
Read more at: http://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html#ixzz59ge7vMHO
Follow us: @gpucomputing on Twitter | NVIDIA on Facebook
It seems that the output of Tensor Cores can be either fp16 or fp32.
At now, we only take fp16 inference into consideration, are we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If all the operators support fp16 kernel, and the input data is fp16, our framework will choose fp16 kernel automatically.
I am not sure if we need this use_float16 attribute. At now, we have this attribute for MulOp, does we need to add this attribute for other operators, like SumOperator?
Or we can add a CastOperator first if the input data is fp32. And after the input data is casted to fp16, the reset operator will choose fp16 kernel automatically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At now, we will check the data type of the inputs of an operator, and enforce the data type to be the same. I am not sure what will happen if one input is fp16, and the other input is fp32.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that the output of Tensor Cores can be either fp16 or fp32.
At now, we only take fp16 inference into consideration, are we?
For the purpose of using tensor core to calculate C = A * B,where A and B are both fp16, we have three ways to get the output C:
-
If C is fp16, the we can use cublasHgemm and the compute type is fp16
-
If C is fp16, we can also use cublasGemmEx(), and the compute type need to be set to fp32, meaning internally the computation is done in fp32.
-
If C is fp32, we can only use cublasGemmEx(), and the compute type is set to fp32
Not sure which mode is the most computationally efficient one.
Right now, we only consider generate fp16 output. But introducing cublasGemmEx() and provide the option of fp16 gemm with fp32 output can be a future to do item.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At now, we will check the data type of the inputs of an operator, and enforce the data type to be the same. I am not sure what will happen if one input is fp16, and the other input is fp32.
Before the operator calls the compute kernel, it will compare the expected data type (via GetExpectedKernelType()) with the actual data type (via GetKernelTypeForVar()) for each input tensor and do data_type_transform (similar to cast op) if necessary.
The default GetExpectedKernelType() would indeed enforce the data type to be the same. That is why I override GetExpectedKernelType() so that we can deal with the situation where one input is fp16 and the other is fp32 (do data_type_transform if necessary by comparing tensor data_type to the expected data_type).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if we need this use_float16 attribute. At now, we have this attribute for MulOp, does we need to add this attribute for other operators, like SumOperator?
Or we can add a CastOperator first if the input data is fp32. And after the input data is casted to fp16, the reset operator will choose fp16 kernel automatically.
Good point! After some thought, I agree we prefer not to add the use_float16 attribute so that we don't accept input tensors with different data type. We can use cast op to bridge different operators if needed. Adding use_float16 attribute will complicate the code and make it more error-prone.
| actual = outs[idx] | ||
| actual_t = np.array(actual) | ||
| # paddle float16 is exposed to python as uint16 type | ||
| # reinterpret the memory as numpy.float16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just expose to python as float16 directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
QiJune
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
fix #8816
fix #9021
numpy float16 is internally represented as numpy.uint16. Hence, we create a binding via the help of the uint16_t type.