Skip to content

[SIMD] auto-vectorization using instruction usdot #63971

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
vfdff opened this issue Jul 20, 2023 · 8 comments
Closed

[SIMD] auto-vectorization using instruction usdot #63971

vfdff opened this issue Jul 20, 2023 · 8 comments

Comments

@vfdff
Copy link
Contributor

vfdff commented Jul 20, 2023

test: https://gcc.godbolt.org/z/f86hxd8cT

#define N 480

unsigned int
f (unsigned int res, signed char *restrict a,
   unsigned char *restrict b)
{
  for (__INTPTR_TYPE__ i = 0; i < N; ++i)
    {
      int av = a[i];
      int bv = b[i];
      signed short mult = av * bv;
      res += mult;
    }
  return res;
}

According gcc-12, Armv8.6-A introduced a new dot-product instruction for when the sign of the operands differ called usdot. This instruction is introduced behind the +i8mm compiler flag.

Starting with GCC 12 the auto-vectorizer can now automatically recognize and use this instruction, while llvm can't.

@llvmbot
Copy link
Member

llvmbot commented Jul 20, 2023

@llvm/issue-subscribers-backend-aarch64

@davemgreen
Copy link
Collaborator

We should add udot/sdot first, although usdot does look like it could hopefully be handled very similarly.

I believe llvm will probably need some sort of "partial reduction" intrinsic in order to teach the vectorizer to do the <16 x i8> to <4 x i32> reduction in the loop, with the <4 x i32> reduction in the remainder as usual.

@antoniofrighetto
Copy link
Contributor

@davemgreen, with SVE enabled, the generated code doesn't look that bad, although I'm not sure about the cost model of a mla over usdot. Also, isn't usdot already part of the target description? Should the intrinsic likely be handled in LV?

@davemgreen
Copy link
Collaborator

The SVE implementation does indeed look better, with the extending loads. It is performing 4 elements per iteration (times the SVE runtime vector size), where as the usdot can do 16 i8's per iteration. There is a SVE usdot instruction too, which would be preferred to mla if it can pick the larger vector factor, so that it can do more per iteration.

The LoopVectorizer needs a target-independent way of representing the intrinsic (probably). There is a neon/sve specific intrinsic for udot/sdot/usdot, but that will only handle certain sizes and not be generic to other targets.

@vfdff
Copy link
Contributor Author

vfdff commented Dec 13, 2023

This issue still exist with #69587

@sjoerdmeijer
Copy link
Collaborator

this and this patch laid the groundwork for this, both patches have been merged. That introduces a partial reduction intrinsic that Dave talked about which is required for this. Now the loop vectoriser is being taught to emit this in this patch.

However, that is not going to help this case: the loop gets fully unrolled, so I don't think the loop vectorizer ever gets to see this loop. Fully unrolling small loops is considered a canonicalisation step, so I guess we will have to teach the SLP vectoriser the same trick.

@davemgreen
Copy link
Collaborator

We might be able to do it in the backend if this is already unrolled. The backend will generate a udot/sdot from a vecreduce.add(mul(ext, ext)). It cannot generate usdot yet, but that should hopefully be a relatively simple addition to what is already present.

From looking at https://gcc.godbolt.org/z/cGnjE9ddG, we might need to reassociate the add out of the way of vecreduce.add(add(mul(ext, ext), ...) in order to match a dot operations.

@fhahn
Copy link
Contributor

fhahn commented May 7, 2025

Vectorizes now with usdot when using -march=armv8.6-a https://gcc.godbolt.org/z/K9ejEbeP5

@fhahn fhahn closed this as completed May 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants