-
Notifications
You must be signed in to change notification settings - Fork 13.4k
[SIMD] auto-vectorization using instruction usdot #63971
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@llvm/issue-subscribers-backend-aarch64 |
We should add udot/sdot first, although usdot does look like it could hopefully be handled very similarly. I believe llvm will probably need some sort of "partial reduction" intrinsic in order to teach the vectorizer to do the <16 x i8> to <4 x i32> reduction in the loop, with the <4 x i32> reduction in the remainder as usual. |
@davemgreen, with SVE enabled, the generated code doesn't look that bad, although I'm not sure about the cost model of a |
The SVE implementation does indeed look better, with the extending loads. It is performing 4 elements per iteration (times the SVE runtime vector size), where as the usdot can do 16 i8's per iteration. There is a SVE usdot instruction too, which would be preferred to mla if it can pick the larger vector factor, so that it can do more per iteration. The LoopVectorizer needs a target-independent way of representing the intrinsic (probably). There is a neon/sve specific intrinsic for udot/sdot/usdot, but that will only handle certain sizes and not be generic to other targets. |
This issue still exist with #69587 |
this and this patch laid the groundwork for this, both patches have been merged. That introduces a partial reduction intrinsic that Dave talked about which is required for this. Now the loop vectoriser is being taught to emit this in this patch. However, that is not going to help this case: the loop gets fully unrolled, so I don't think the loop vectorizer ever gets to see this loop. Fully unrolling small loops is considered a canonicalisation step, so I guess we will have to teach the SLP vectoriser the same trick. |
We might be able to do it in the backend if this is already unrolled. The backend will generate a udot/sdot from a vecreduce.add(mul(ext, ext)). It cannot generate usdot yet, but that should hopefully be a relatively simple addition to what is already present. From looking at https://gcc.godbolt.org/z/cGnjE9ddG, we might need to reassociate the add out of the way of |
Vectorizes now with usdot when using |
test: https://gcc.godbolt.org/z/f86hxd8cT
According gcc-12, Armv8.6-A introduced a new dot-product instruction for when the sign of the operands differ called usdot. This instruction is introduced behind the +i8mm compiler flag.
Starting with GCC 12 the auto-vectorizer can now automatically recognize and use this instruction, while llvm can't.
The text was updated successfully, but these errors were encountered: