-
Notifications
You must be signed in to change notification settings - Fork 7
Thoughts about long vectors #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
See the distilled version of this discussion here: https://github.com/WebAssembly/design/blob/fc886f061e825d154aca910f9ab3b511b7fdf4ea/FutureFeatures.md#long-simd |
I fail to see where a "dynamic" vector width has been discussed. So let me answer (very quickly) to the questions from the FutureFeatures document:
Pretty well for most operations.
GPUs should be pretty similar, but with very long vectors.
This is a hard bit. Auto-vectorization would be mostly fine (SVE compilers are able to handle this). But hand-written C (or C++) code with "dynamically" sized vectors would be difficult: the standard is not suited for that yet.
In that case, none. But interfaces would/should be similar.
None. At least no more than short SIMD. Source of non-determinism: fast reciprocals, optionally fused add-mul, unspecified reduction orders...
Every single architecture will be able to support such long SIMD interface as the size could be one. This is my feeling, and I just answer quickly without a deep and long reasoning on the questions. |
No, they were not. |
Going through all the links you gave me (recursively), I only see discussions from 2015. Which is before SVE that was announced in august 2016. Dynamic size is, IMO, the key feature a long SIMD API should have. |
Yes, that is correct. It doesn't make my statement incorrect. |
I don't get you, then... |
(drive-by mention) Before ARM SVE, there were Cray vectors, and RISC-V will probably get something in the vein of Cray/SVE vectors before (fixed-size) "packed SIMD". Not sure this repo is the appropriate place to discuss non-fixed-size vectors, as the interesting implementations are quite different from "packed SIMD", and require some concept of "machine register of unknown/dynamic size" at compile-time (and in WASM's case, also in the bytecode). |
@lemaitre I suspect @jfbastien is implying that, while the discussions were before the public announcement of SVE, there may have been parties to the conversation that were aware of the SVE work, and those could have/did find ways of raising the option without violating any agreements they might be bound by. |
Just to confirm, in discussions where I've been a participant, "long vectors" mean dynamic vector lengths. The classic example of this is the Cray family of vector processors. These ideas came to mind as we started thinking about how best to expose AVX2, AVX-512, and possibly GPUs, in a portable way, if possible. SVE is now another architecture that might benefit, but it isn't what's driving these conversations, in part because so few chips actually include it. I see long vectors as being complementary to short vectors. That is, they both have things they can do better than the other (there is also overlap). As such, adding short vectors to wasm won't necessarily preclude adding long vectors in the future. |
I prefer "packed SIMD" vs "flexible/scalable/Cray vectors", in terms of nomenclature, because it makes the distinction clearer, IMO. @gnzlbg and I also agreed that they could coexist in the Rust ecosystem, especially given that packed SIMD exists today in many architectures, and even if you could support a nicer programming model by implementing dynamic length vectors on top of it, LLVM is barely beginning to get some support for it, so we'd have to either write our own backend support for them or wait. I was looking for a discussion on this for WASM because it "feels" like WASM could benefit from dynamic vector lengths in the bytecode, but if the tooling isn't ready, to generate appropriate (e.g. SSE) native SIMD instructions, then it'd be all for naught. |
There is not much that I can add here beyond that I agree with what @sunfishcode stated here (https://github.com/WebAssembly/simd/issues/29#issuecomment-414205285): variable-length vectors and packed vectors are orthogonal. @rkruppe is implementing variable length vector support in LLVM and is aware of the similarities and differences between ARM SVE and RISC-V vectors. I don't know whether he has given some thought to WASM yet - the LLVM WASM backend is just starting to gain |
I re-read some of the old topics about long-vectors, while it is not clear that it's about "non-fixed" sized vectors, it is true that is is mentioned (at least here for SIMD.js). I never studied how cray machines work, so I might be incorrect, but have the impression it does not scale well with the complexity of the computation. Regarding GPUs, that's true they are some kind of SIMD machines, but they definitely don't have a SIMD ISA. The parallelism is not exposed by the instructions. I think that the best way to have "long vectors" in WASM is to use the same idea as SVE: The generation from such an ISA would be mostly trivial as it would fit directly into the target SIMD ISA. On the naming, I agree "long vectors" is not good. I'm not sure "scalable vectors" would be authorized because of SVE. I'm not a big fan of "Cray vectors". I also agree that short and long vectors are two different beasts and are actually orthogonal. Now, let me introduce some straw-man syntax to continue the discussion further. There is a new type: We can require its size to be either a power of 2, or a multiple of 128. (The latter is probably more flexible, but the former is probably simpler to generate WASM) Interpretation of the content:
Regular operations are the same as packed simd: apply lane-wise. Extra operations:
Here is a correspondance table (only for F32 for simplicity)
You can see with this table how immediate the translation is for simple instructions. But we need masking, and we need to be clever about it. Let's introduce some mask types:
N represent the bit-width of the lanes being predicated. The
Every single vector instruction would be masked, but the masks live in a second stack. Some extra instructions would be needed to take care of the mask stack. All in all, I really think that with this design, the WASM->ASM compilation for "long vectors" would be as simple as for "short vectors". |
The hard part of vectorizing code (i.e. finding work to do using vectors) would be done to generate loops using the WASM dynamic-length vector instructions, at compile-time. The WASM runtime would "only" need to duplicate each loop body that uses such an instruction, and specialize for the "one element at a time" case when the number of elements remaining is less than what the native packed SIMD support allows. The rest of your comment goes into more detail that overlaps with what I've described above, but AFAIK it shouldn't matter to the WASM spec/generators whether the machine fixes a vector length or each vector-using loop has its own, and implementations can do either or hardcode vector length to 1.
|
That would be a trivial task only for trivial loops. int *A, *C;
#pragma simd
for (int i = 0; i < n; i++) {
if (A[i] > 0) {
C[A[i]] += 1;
}
} or this: char *s;
int n;
#pragma simd
while (*s != 0) {
n++;
s++;
} These require a vectorizing compiler. It's not just converting scalar operations into vector ones.
The problem is: I don't see how the loop approach can work on non-trivial code.
I see another problem: |
Yes, but that would only need to exist in the compilation to WASM, e.g. in LLVM. For your second example, I imagine it'd look something like this, unvectorized: start(n: i32, s: i32):
v = load.i8 s
c = icmp eq v, 0
n1 = add n, 1
s1 = add s, 1
br c, exit(n), start(n1, s1)
exit(n: i32):
ret n At compile-time, before WASM, a vectorizing compiler could turn it into: start(n: i32, s: i32):
setvl MAXVL ; start as wide as possible
v = vec.speculative.load.i8 s
; VL is now equal to the number of successful loads
; (e.g. if a fault happened at s + 23, VL = 23)
exits = vec.icmp.i8 eq v, 0
; `exits` is a boolean/mask vector where the first non-0
; element is where the original loop would've exited.
exit_count = vec.count_ones exits
loop_count = vec.count_leading_zeros exits
; perform the effects of several loop iterations at once
n_after = add n, loop_count
s_after = add s, loop_count
; leave the loop if the original would've done so by now
any_exit = icmp ne exit_count, 0
br any_exit, exit(n_after), start(n_after, s_after)
exit(n: i32):
ret n And WASM would contain some similar operation set and equivalent semantics. An WASM implementation can choose to:
That last option could theoretically end up lowering the WASM to this (given start(n: i32, s: i32):
(v, mask) = simd.speculative.load.i8x32 s
; `mask` contains non-0 for successful loads
c = simd.icmp.i8x32 eq v, 0
exits = simd.and.boolx32 c, mask
; `exits` is a boolean/mask vector where the first non-0
; element is where the original loop would've exited.
exit_count = simd.count_ones.boolx32 exits
loop_count = simd.count_leading_zeros.boolx32 exits
; perform the effects of several loop iterations at once
n_after = add n, loop_count
s_after = add s, loop_count
; leave the loop if the original would've done so by now
any_exit = icmp ne exit_count, 0
br any_exit, exit(n_after), start(n_after, s_after)
exit(n: i32):
ret n Note that this didn't even need the "one element at a time" copy of the loop body, that I was mentioning in my comment, because there's no "elements left to process to pass to If we take a slightly more contrived example, like start(n: i32, remaining: i32, s: i32):
c = icmp eq remaining, 0
br c, exit(n), rest
rest:
v = load.i8 s
c1 = icmp eq v, 0
n1 = add n, 1
s1 = add s, 1
remaining1 = sub remaining, 1
br c1, exit(n), start(n1, remaining1, s1)
exit(n: i32):
ret n And some compile-time (somewhat mediocre) vectorization: start(n: i32, remaining: i32, s: i32):
setvl remaining ; don't go past the provided limit
v = vec.speculative.load.i8 s
; VL is now equal to the number of successful loads
; (e.g. if 23 < remaining, and a fault happened at s + 23, VL = 23)
exits = vec.icmp.i8 eq v, 0
; `exits` is a boolean/mask vector where the first non-0
; element is where the original loop would've exited.
exit_count = vec.count_ones exits
loop_count = vec.count_leading_zeros exits
; perform the effects of several loop iterations at once
n_after = add n, loop_count
s_after = add s, loop_count
remaining_after = sub remaining, loop_count
; leave the loop if the original would've done so by now
exit_count_nonzero = icmp ne exit_count, 0
remaining_after_zero = icmp eq remaining_after, 0
any_exit = or exit_count_nonzero, remaining_after_zero
br any_exit, exit(n_after), start(n_after, remaining_after, s_after)
exit(n: i32):
ret n Now if we have that in some WASM encoding and want to replicate that last packed-SIMD-with-speculated-load experiment, but let's say we have no masked loads, we'd end up with: start(n: i32, remaining: i32, s: i32):
c = icmp ult remaining, 32
br c, start_scalarized(n, remaining, s), rest
rest:
(v, mask) = simd.speculative.load.i8x32 s
; `mask` contains non-0 for successful loads
c1 = simd.icmp.i8x32 eq v, 0
exits = simd.and.boolx32 c1, mask
; `exits` is a boolean/mask vector where the first non-0
; element is where the original loop would've exited.
exit_count = simd.count_ones.boolx32 exits
loop_count = simd.count_leading_zeros.boolx32 exits
; perform the effects of several loop iterations at once
n_after = add n, loop_count
s_after = add s, loop_count
remaining_after = sub remaining, loop_count
; leave the loop if the original would've done so by now
exit_count_nonzero = icmp ne exit_count, 0
remaining_after_zero = icmp eq remaining_after, 0
any_exit = or exit_count_nonzero, remaining_after_zero
br any_exit, exit(n_after), start(n_after, remaining_after, s_after)
; duplicated subgraph of the original CFG, with MAXVL = 1
start_scalarized(n: i32, remaining: i32, s: i32):
vl_0 = icmp eq remaining, 0
br vl_0, start_scalarized_vl_0, start_scalarized_vl_1
; further duplication, for VL = 1 vs VL = 0
start_scalarized_vl_1:
v = load.i8 s
exits = icmp.i8 eq v, 0
exit_count = cast exits as i32
loop_count = sub 1, exit_count
n_after = add n, loop_count
s_after = add s, loop_count
remaining_after = sub remaining, loop_count
remaining_after_zero = icmp eq remaining_after, 0
any_exit = or exits, remaining_after_zero
br any_exit, exit(n_after), start_scalarized(n_after, remaining_after, s_after)
; this is weird only because of re-rescalarized previously-vectorized WASM
; (the original would go to `exit` directly, without `start_scalarized_vl_0`)
; while this can optimize to `goto exit`, I left some of it around for clarity
start_scalarized_vl_0:
exit_count = 0
loop_count = 0
n_after = n
s_after = s
remaining_after = remaining ; 0
exit_count_nonzero = false
remaining_after_zero = icmp eq remaining_after, 0 ; true
any_exit = or exit_count_nonzero, remaining_after_zero ; true
br any_exit, exit(n_after), start_scalarized(n_after, remaining_after, s_after)
exit(n: i32):
ret n As you can see, it's certainly plausible, but not fun at all, and could generate inefficient code. |
You didn't understand me, but that's fine as I realized I didn't understand you.
I had the impression you had something similar to this in mind. But apparently you didn't.
Unless There is no fixed width SIMD architecture that needs speculative loads: as long as the accesses are aligned, you will always load at most one cache line alone. Also, the translation to fixed-width simd ISA will be not really different from the translation to SVE. I think the
I don't see where you are trying to go. The code you tried to show would definitely not be the one generated from the previous code, for any architecture. |
More semantic oriented vector types, that allow a loop or code to be specialized either to short or wide vectors (depending what hardware provides), or the ones that are flexible (i.e. depending on hardware), like ARM SVE, would really make it nice to use, and make the code even more portable and future proof I think. Even if the hardware doesn't support scalable vectors, compiler would know what it the wides available vector that can support all used operations and use it in generated code. The big piece of it to abstract a width of a vector. I.e. provide reduce operations ( I think design should really focus on AVX2/AVX512 and SVE in principle (and provide efficient fallbacks for "legacy" SIMD), even if these are not widely available yet. It will change in few years dramatically, and it is important for the Web. So for example, it should do use masking extensively, conflict detection, etc. I understand wide vector and non-fixed vectors are a two different things, but again, I think @lemaitre makes a lot of good points. |
Are vector types / APIs that work for both packed vector types and Cray vectors available on any programming language, hardware, etc. ? AFAIK designing a proper "API" / "instruction set" for doing this that can be re-targetted for hardware supporting either packed or Cray (or both) vector types is an open research problem.
For all we know right now, ARM SVE might die next year, and RISC-V vector extensions might never become mainstream, or worse, they might release a slightly different future revision of those extensions that does become mainstream but is incompatible with anything that we can imagine right now. The |
A bit different way to look at it, flexible-width vector operations can be thought of as a way to provide compatibility between platforms that have different SIMD width, as long as the operations are still within the "common denominator" set. Supported vector length would vary based on what hardware you are running, while the operations would stay the same. I think it would also reduce WASM instruction count for the same kernel, and probably improve performance, as memory bookkeeping code would move to runtime. There is a relatively short post by Patterson and Waterman covering that with respect to native code, but it would apply here as well. |
Disclaimer: I tried to write up a longer version, but it ended up being may be a bit too long, so my apologies for those who would try to read this. In short, I think virtual vector operations can map to packed SIMD with some extensions, so hardware vector support is not needed to support this idea in WASM. I'm going to try to illustrate this with element-wise addition of two vectors. There are more interesting examples, but this would be still representative of the challenges of producing efficient data-parallel code. One can implement SIMD vector // a = b + c
// LEN is numer of SIMD lanes
void add(float * a, float * b, float * c, unsigned sz) {
unsigned i;
for (i = 0; i < sz / LEN; ++i) {
simdstore(a, simdfadd(simdload(b), simdload(c)));
a += LEN;
b += LEN;
c += LEN;
}
for (i = 0; i < sz % LEN; ++i) {
*a = *b + *c;
++a;
++b;
++c;
}
} This lowers to hardware SIMD instructions that in WASM's case represent a portable subset between major architectures. Note the second, scalar, loop - it seems like not much, but can have detrimental effects on performance if the function is called repeatedly. Flexible-width vectors without partial-width operations would work similarly to packed SIMD, but with vector size set at runtime. Let's say // a = b + c
// Use runtime-defined vector size, without partial-width vector operations
void add(float * a, float * b, float * c, unsigned sz) {
unsigned i;
unsigned len = enable_vec_ops();
for (i = 0; i < sz / len; ++i) {
vecstore(a, vecfadd(vecload(b), vecload(c)));
a += len;
b += len;
c += len;
}
for (i = 0; i < sz % len; ++i) {
*a = *b + *c;
++a;
++b;
++c;
}
} This would map to hardware operations within "extended" portable subset - same line-wise operations, but width can be different on different platforms. Virtual instructions would lower to packed SIMD hardware instructions. Benefit of this approach is limited to enabling longer operations on platforms that support them without degrading performance on platforms that don't. Scalar loop is still there, for example. What is meant by 'flexible-width' vector instructions would usually include a mechanism to eliminate the second loop, by tucking it into a computation with less lanes. Suppose // a = b + c
// Maximum vector size is a runtime constant,
// set vector size dynamically between 1 element and that value
void add(float * a, float * b, float * c, unsigned sz) {
int count = sz;
while (sz > 0) {
// Set width to min of count and supported width
// result is written back to `count` variable
set_vec_length(&count);
vecstore(a, vecfadd(vecload(a), vecload(b)));
sz -= count;
a += count;
b += count;
c += count;
}
} Vector instructions here should lower to packed SIMD or vector hardware instructions (if the latter is available). This approach should reduce both hardware instruction count and WASM instruction count, and reduce amount of branching in WASM. I think it the more beneficial approach, as it would allow simplifying code generation and reduce instruction count even it would fail to implement partial-width operation in one pass. The gotcha is that some platforms support SIMD operation masking and some don't - in worst case those would be emulated using a scalar loop, which is equivalent to the current situation. |
@penzn Yes, essentially something like this. However, in reality the last loop, will be done probably using masks, and first iteration of the first loop will also be done using masks, to account for possible misalignment. the count or len, will usually be known only be the compiler at runtime, and will affect the machine code specialization. For SVE / SVE2, it might end even as a non-constant during runtime, and only known fully by the hardware (compiler can find it out, but don't need to). Worth reading: https://arxiv.org/pdf/1803.06185.pdf http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4008.pdf GCC does support it reasonably well. LLVM support is out of tree at the moment, pending merging. |
To see how adding this to Wasm might look like, I've put a slide deck together (source). There are a few projects providing flexible SIMD to native applications. for example Highway. This shows how runtime resolution of SIMD width may look like. |
@jan-wassenberg, what do you think about this? |
Thanks for reaching out! I like the direction, IMO it's important to have unknown-length types for two reasons:
This blurs the line between "packed" and "flexible" - in both cases, the app doesn't know the width. If I understand a previous comment correctly ("flexible-width vector operations can be thought of as a way to provide compatibility between platforms that have different SIMD width"), that's also advocating 1).
We're close to this now with Highway (thanks for linking it above) - the API supports packed vectors of For 1), Highway shows that most operations can be defined so that they are efficient on both AVX2/512 and SVE. I'm a bit concerned about the use of set_vec_length/masks for handling the last lanes. This is great for SVE, but less so for packed SIMD. I agree scalar loops aren't awesome, but perhaps there is an alternative. However, I agree it can make sense to use smaller vectors than the maximum supported, e.g. no more than 8 lanes for 8x8 DCT. If set_vec_length is limited to powers of two and the runtime can also use smaller lengths, perhaps we're saying the same thing? |
Thanks @jan-wassenberg. This is a pretty good description of the challenges going forward, closing this issue in favor of #7 |
Hello everyone,
Maybe it's a bit soon to start talking about long vectors where short vectors aren't specified yet, but I wanted to share my point of view on the subject.
Fixed width SIMD would be a nightmare to handle portably as not all the architectures have the same SIMD width.
For instance, AVX provides 256 bit registers, and AVX512 provides 512 bit registers.
Actually, ARM presented a while ago its new SIMD ISA: Scalable Vector Extension (SVE).
The idea behind SVE is that the size of the registers is not known during compilation and is hardware dependent: the same binary would be able to run efficiently on different hardware with different SIMD width.
I think that is exactly the abstraction we need for WASM to handle long vectors.
The SIMD width would be a constant that is not known at compile time (when the WASM code is generated), but known when the VM runs the code.
What do you think?
The text was updated successfully, but these errors were encountered: