Description
Background and motivation
When implementing vectorized algorithms using the new Vector128
and friends APIs, Vector128.Shuffle
is a case where one sometimes has to drop back down to using architecture-specific instructions.
Vector128.Shuffle
guarantees that results for out-of-range indices (values above 15) are normalized to 0.
If the JIT can't prove that indices are already in the [0, 15] range, it is forced to emit extra logic to do the normalization.
For some algorithms, you may also want to have faster logic that uses other underlying underlying intrinsics native to the platform (e.g. X86's PSHUFB will treat indices MOD 16, and set the result to 0 if the high bit is set).
The solution in both cases is to use the architecture-specific instruction directly, often writing a helper such as
private static Vector128<byte> Shuffle(Vector128<byte> vector, Vector128<byte> indices)
{
return Ssse3.IsSupported
? Ssse3.Shuffle(vector, indices)
: AdvSimd.Arm64.VectorTableLookup(vector, indices);
}
(We already have 5 such helpers in runtime)
I propose adding a Vector128.ShuffleNative
helper that does not guarantee identical behavior for out-of-range indices across platforms. Instead, it would be defined as using one of the other underlying intrinsic directly when available to produce faster code. Code that strictly needs a particular behavior for a platform would still need to use the platform specific intrinsics directly.
API Proposal
namespace System.Runtime.Intrinsics;
public static class Vector128
{
public static Vector128<byte> ShuffleNative(Vector128<byte> vector, Vector128<byte> indices);
public static Vector128<sbyte> ShuffleNative(Vector128<sbyte> vector, Vector128<sbyte> indices);
public static Vector128<short> ShuffleNative(Vector128<short> vector, Vector128<short> indices);
public static Vector128<ushort> ShuffleNative(Vector128<ushort> vector, Vector128<ushort> indices);
public static Vector128<int> ShuffleNative(Vector128<int> vector, Vector128<int> indices);
public static Vector128<uint> ShuffleNative(Vector128<uint> vector, Vector128<uint> indices);
public static Vector128<long> ShuffleNative(Vector128<int> vector, Vector128<long> indices);
public static Vector128<ulong> ShuffleNative(Vector128<uint> vector, Vector128<ulong> indices);
public static Vector128<float> ShuffleNative(Vector128<float> vector, Vector128<int> indices);
public static Vector128<double> ShuffleNative(Vector128<double> vector, Vector128<ulong> indices);
}
public static class Vector256
{
public static Vector256<byte> ShuffleNative(Vector256<byte> vector, Vector256<byte> indices);
public static Vector256<sbyte> ShuffleNative(Vector256<sbyte> vector, Vector256<sbyte> indices);
public static Vector256<short> ShuffleNative(Vector256<short> vector, Vector256<short> indices);
public static Vector256<ushort> ShuffleNative(Vector256<ushort> vector, Vector256<ushort> indices);
public static Vector256<int> ShuffleNative(Vector256<int> vector, Vector256<int> indices);
public static Vector256<uint> ShuffleNative(Vector256<uint> vector, Vector256<uint> indices);
public static Vector256<long> ShuffleNative(Vector256<int> vector, Vector256<long> indices);
public static Vector256<ulong> ShuffleNative(Vector256<uint> vector, Vector256<ulong> indices);
public static Vector256<float> ShuffleNative(Vector256<float> vector, Vector256<int> indices);
public static Vector256<double> ShuffleNative(Vector256<double> vector, Vector256<ulong> indices);
}
public static class Vector512
{
public static Vector512<byte> ShuffleNative(Vector512<byte> vector, Vector512<byte> indices);
public static Vector512<sbyte> ShuffleNative(Vector512<sbyte> vector, Vector512<sbyte> indices);
public static Vector512<short> ShuffleNative(Vector512<short> vector, Vector512<short> indices);
public static Vector512<ushort> ShuffleNative(Vector512<ushort> vector, Vector512<ushort> indices);
public static Vector512<int> ShuffleNative(Vector512<int> vector, Vector512<int> indices);
public static Vector512<uint> ShuffleNative(Vector512<uint> vector, Vector512<uint> indices);
public static Vector512<long> ShuffleNative(Vector512<int> vector, Vector512<long> indices);
public static Vector512<ulong> ShuffleNative(Vector512<uint> vector, Vector512<ulong> indices);
public static Vector512<float> ShuffleNative(Vector512<float> vector, Vector512<int> indices);
public static Vector512<double> ShuffleNative(Vector512<double> vector, Vector512<ulong> indices);
}
public static class Vector64
{
public static Vector64<byte> ShuffleNative(Vector64<byte> vector, Vector64<byte> indices);
public static Vector64<sbyte> ShuffleNative(Vector64<sbyte> vector, Vector64<sbyte> indices);
public static Vector64<short> ShuffleNative(Vector64<short> vector, Vector64<short> indices);
public static Vector64<ushort> ShuffleNative(Vector64<ushort> vector, Vector64<ushort> indices);
public static Vector64<int> ShuffleNative(Vector64<int> vector, Vector64<int> indices);
public static Vector64<uint> ShuffleNative(Vector64<uint> vector, Vector64<uint> indices);
public static Vector64<long> ShuffleNative(Vector64<int> vector, Vector64<long> indices);
public static Vector64<ulong> ShuffleNative(Vector64<uint> vector, Vector64<ulong> indices);
public static Vector64<float> ShuffleNative(Vector64<float> vector, Vector64<int> indices);
public static Vector64<double> ShuffleNative(Vector64<double> vector, Vector64<ulong> indices);
}
API Usage
-Vector128<byte> bitMask = Shuffle(bitmapLookup, lowNibbles);
-Vector128<byte> bitPositions = Shuffle(Vector128.Create(0x8040201008040201).AsByte(),
+Vector128<byte> bitMask = Vector128.ShuffleNative(bitmapLookup, lowNibbles);
+Vector128<byte> bitPositions = Vector128.ShuffleNative(Vector128.Create(0x8040201008040201).AsByte(), highNibbles);
-private static Vector128<byte> Shuffle(Vector128<byte> vector, Vector128<byte> indices)
-{
- return Ssse3.IsSupported
- ? Ssse3.Shuffle(vector, indices)
- : AdvSimd.Arm64.VectorTableLookup(vector, indices);
-}
Alternative Designs
No response
Risks
No response