-
Notifications
You must be signed in to change notification settings - Fork 327
Introduce Subgroup Operations Extension #954
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This looks really good. Some things that I think need to be stated for the record:
Some additional questions:
|
Added to the PR explainer.
It is now exposed in both host and device, that was a mistake related to misreading of DirectX 12 spec, turns out that WaveLaneCountMax is just a reserved name.
Explained in the previous question.
That's how Metal API exposes the subgroup size. One can assume that subgroup size should be obtainable by creating a compute pipeline, but if a device (maybe in near future) can execute different kernels in different regions with different subgroup size (which would give rationale to the choice of Metal) depending on requirements of the kernel or power preference, then that could be a wrong assumption.
Essentially HLSL lacks shuffle and relative shuffle operations and MSL lacks all equal operations. Removing them leaves us with this subset.
Uniformity analysis could be delegated to another investigation that will also clarify uniformity for derivative functions too, but related spec from DirectX 12: "A set of lanes (threads) executed simultaneously in the processor. No explicit barriers are required to guarantee that they execute in parallel. Similar concepts include 'warp' and 'wavefront.'"
I also think that can be better.
I think that can also be delegated to another investigation but using these functions without the extensions should be banned and reported to the user at pipeline creation time. Allowing presence of function in places that won't be executed can rather be an implementation detail but giving guarantees for dynamic behavior at rejection on these predicates might be impossible.
It seems like, in the context of compute kernels, the concept of helper threads might be irrelevant. Related specification from Metal: "
Subgroup size can be a built-in decoration but I think the other two should be preserved as functions.
That's correct, fixed.
Undefined behavior.
That function doesn't exist in HLSL but can be easily emulated. Should we include?
Grouping these two together to one extension reduces potential of both operation groups. Firstly, where quad operations exist, the extent of support for subgroup operations isn't large enough to accommodate the reduction operations proposed here. And where subgroup operations exist, there are cases where it is compute only but be accommodating to the set of operations proposed here. Quad operations, where supported, definitely have shuffle operations. If we were to split the subgroup operation in a way that it contains quad operations we would end up with three different extensions and discussion of undefined functions would get more nuanced. |
wgsl/index.bs
Outdated
<tr><td>Subgroup built-in functions<td>SPIR-V | ||
</thead> | ||
<tr><td>subgroup_size() -> u32<td>SubgroupSize | ||
<tr><td>subgroup_local_index() -> u32<td>SubgroupLocalInvocationId |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two appear to be builtin's in SPIR-V, not function calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved, but I am not sure if things could have better tidy
wgsl/index.bs
Outdated
<tr><td>subgroup_or(*T*) -> *T*<td>OpGroupNonUniformBitwiseOr | ||
<tr><td>subgroup_xor(*T*) -> *T*<td>OpGroupNonUniformBitwiseXor | ||
<tr><td>subgroup_prefix_add(*T*) -> *T*<td>OpGroupNonUniformIAdd or OpGroupNonUniformFAdd with ExclusiveScan | ||
<tr><td>subgroup_prefix_mul(*T*) -> *T*<td>OpGroupNonUniformIMul or OpGroupNonUniformFMul with ExclusiveScan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For all these where the param is T, what are the possible values of T?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tried to explicitly state the possible types, does it look OK now?
In general I think the subgroup extensions should be done in more piecemeal chunks because not all hardware supports all subgroup functionality. We don't need to match the segmentation of other APIs perfectly (because some amount of emulation is possible), but AFAIK things like quad op aren't supported everywhere and not possible to emulate. Also shouldn't the subgroup size be a constant for a given device?
Imagination has announced hardware with a SIMD width of 128 (yeah that huge, I agree). |
That's exactly why I excluded quad operations, but I think the proposed subset of subgroup operations here makes the most out of existing API surfaces while considering the mobile hardware to come to market soon. My original idea was to break this PR into two extensions:
In WWDC20 slides, Apple mentions that applications should rely on limits where they are exposed for a successful transition to Apple Silicon. But again, I am not sure about why the subgroup size was exposed under compute pipeline in Metal, though there can be valid reasons.
Reverted to vec4 |
Vulkan extension mentioned today: |
Following up on the call today. I asked if on Vulkan/Intel we can ask the driver about what subgroup size does a particular compute pipeline have. If we can do this, a Metal-like threadExecutionWidth query becomes possible. Apparently, there is an extension in Vulkan that exposes this information - VK_KHR_pipeline_executable_properties . It's relatively fresh, and the support on Windows Vulkan/Intel only comes in a few reports, but timeline-wise, it looks like it can be there by the time we ship. |
Regarding:
This is a link to an old slide deck that predates the implementation of Wave intrinsic support for SM 6.0+. Quad operations are in fact supported in compute shaders for DX in SM 6.0+. See Quad-Wide Shuffle Operations. I see further reasoning based on the assumption that quad operations are not supported everywhere subgroup operations are supported (under compute). I don't know if that's entirely based on the assumption around DX, but if it is, further rethinking may be necessary. |
I also took Adreno and PowerVR reports into account, they don't have quad support. But DirectX assumption was a misinterpretation at the investigation's side. |
Ok here's a small investigation to help with the discussion of fixed vs. variable size subgroups. Investigation on subgroup sizes for WebGPUGetting adapter/device information about the size of subgroups.D3D12 has the Metal doesn't have an adapter/device query for the subgroup sizes. Vulkan only has Getting pipeline information about the size of subgroupsD3D12 doesn't have anything. Metal exposes the actual subgroup size via Vulkan doesn't have anything in core but Devices with variable subgroup size and conclusionExtracting data from vulkan.gpuinfo.org, most devices have a fixed subgroup size except Intel GPUs (8, 16, or 32) and AMD RDNA GPUs (32 or 64). I think it is safe to assume Apple GPUs have a fixed subgroup size like Imagination GPUs, but it would be nice to have confirmation. The investigation above shows that we cannot get a per-device/adapter fixed subgroup size because it's not available on Metal. It also shows we cannot get a per-pipeline subgroup size because it's not available on D3D12 (although it could be on Vulkan depending on extensions). So maybe in a first step we could have subgroups without any API-side query as to what the subgroup size is. Future possibilitySee the section below for why fixed subgroup sizes would be useful, I think it is crucial for WebGPU to have them. What's nice is that almost all GPUs have a fixed subgroup size except:
On D3D12 and Vulkan the API / extensions allow us to know if the subgroup size is fixed by comparing the min and max size, and on Metal we can detect with the vendor/device ID (or GPU family) if the subgroup size is fixed (+get the size by compiling a dummy pipeline). So I think we could have an extension for fixed subgroup size available everywhere except the configurations listed above. Then if D3D12 allowed controlling the subgroup size (feature request wink wink), only Metal Intel and AMD RDNA wouldn't have the extension. That's probably ok. Why we want a fixed subgroup sizeVariable size subgroups allows for some tricks like scalarization to trade SGPR vs. VGPR but advanced compute algorithm need more control to reach top performance, and to be correct at all. I was sitting next to the Spinel team that's doing path rasterization using GPU compute in a way that's a different but not 100% unlike this Raph Levien blog post. The pipeline is 100% compute based and uses a ton of subgroup operations for performance. I asked one of their engineers why they need fixed subgroup sizes and paraphrased below:
Note that Spinel is just one heavy user of subgroups but there are many more. For example the state of the art prefix-sum algorithm "Single-pass Parallel Prefix Scan with Decoupled Look-back" uses subgroups and needs the subgroup size to size arrays in shared memory. There are many more examples. Other things to figure out.
|
I want to comment on a few points:
It is important to note that max variant is just a placeholder according to the specification.
I think doing emulation can kill the benefits on hardware that doesn't have reduction operations (and shuffle, shuffle-relative don't exist in DirectX's HLSL). Considering emulation to make single-code-path variants possible can go all the way to exposing subgroup size as one on hardware not fit enough to present the extension. I think that's not fit for WebGPU and possibly can set a malformed precedent that can make it extra complicated for implementers. In such an environment, the user can fall back to workgroup shared variants of their algorithms instead of relying on the emulation of WebGPU. (Though I think Metal's |
Also, I'd like to ask the group's opinion on using |
Subgroups are an awesome feature. But they have very big caveats, particularly how graphics APIs support them. There's a distinction between a uniform subgroup model, and a non-uniform subgroup model. Uniform subgroup modelThe cl_khr_subgroups extension to OpenCL C 2.0 (and later in core OpenCL 2.1) uses a uniform subgroup model: each subgroup operation must be collectively executed by all the invocations in the subgroup. If not all invocations participate, then it's undefined behaviour.
Nonuniform subgroup modelVulkan, D3D, and Metal use a non-uniform model: Not all invocations in a subgroup have to execute a subgroup operation. Ambiguities about divergenceWhen invocations in a subgroup are executing "together", how long is that guaranteed? E.g. a subgroup barrier causes invocations to in a subgroup to wait for each other before any can continue executing. They are subgroup-uniform when the leave the barrier. But how long does that last? Certainly it's broken by control flow where different invocations take different paths. But can it be broken sooner than that?
Ambiguities about reconvergenceResults of subgroup operations are affected by which invocations are executing together. (Or whether you even have undefined behaviour.) But once invocations diverge, what are the guarantees about where you reconverge? Vulkan/SPIR-V has extremely weak guarantees: You either have full workgroup uniformity, or you don't. Getting back to full workgroup-uniformity requires all invocations in the workgroup exiting a structured control flow construct (to its merge block) where the whole workgroup had collectively entered that construct. There is no rule for finer grain reconvergence. I don't think D3D has anything stronger than that. I don't know enough about MSL here. Ambiguities about forward progressSubgroups introduce a question of forward progress:
D3D, Metal, and Vulkan are silent on both of these. This leads to non-portability. This interacts very strongly with atomics (and loops). Ambiguities about helper invocationsIn a fragment shader, some invocations could be helper invocations, or could have been converted to one by a (D3D-style) discard. Do those helper invocations participate in subgroup operations? I believe Vulkan/SPIR-V is silent on this, and there may be different behaviours. MSL says that helper invocations are not "active". I don't know enough about D3D to say. SummarySubgroups have many sharp corners for introducing ambiguity, non-portability, and undefined behaviour. The target APIs have sufficiently tight rules to allow good, portable, and reliable subgroup features. |
WebGPU telecon today approved the API-side of this PR. The shading-language side of this PR still needs to be approved. |
Based on the sample I built for the W3C Machine Learning Workshop, I have compared the execution speed with atomic (since atomics do not support floating-point operations, numerical loss happens with them) and shared alternatives. It turns out the SIMD version, which only utilizes the set provided in this PR, beats others by at least 2x on both Intel and Nvidia hardware. Such execution time difference can make a real impact on exploratory data analysis applications and potentially any application that hopes to run on GPU in a portable setting (to avoid battery drain and heat). And this is consistent with the findings of state-of-the-art particle methods, as mentioned in this slide deck, which gain ~10x speed increase with SIMD operations. |
I transcribed @dneto0's above example to Metal: Convergence.zip |
And here it is transcribed to D3D12: Convergence.zip |
@litherum did you run them on Metal and D3D12? Did you get results similar to Vulkan where it diverges? |
I did some data gathering:
"✅" means the output was "255 240 240 240". |
@litherum I didn’t meet the hang issue on Intel(R) HD Graphics 520 with driver 27.20.100.8587 on windows. The result is 255 240 240 240. |
Which card is it? We'd like to investigate further. |
@litherum you probably have the latest OEM Intel graphics drivers for your system. To get newer drivers you can try the Intel driver assistant, but if it says something like 'you have OEM drivers that can't be updated by the assistant', I think you can download a specific driver package, probably this one. It might disable some hardware integration (like certain display-panel-specific features). |
Like Kai said, you need to download a specific driver package from here. But it may reject you to upgrade the driver. The solution is to uninstall driver via device manager, then install the downloaded Intel driver package. |
You may find the detailed instructions to install Intel Graphics Driver on OEM devices at https://docs.google.com/document/d/1Fr5hi6BqlLVaJJoZEN7sGjukF4kM2qOAFb8mtbYx1Fo/edit#heading=h.4rbfm5zbtbyd |
Discussed at the 2020-09-29 meeting. |
FYI. Nicolai Hähnle will be presenting at the LLVM Dev Meeting on Thursday (October 8). "Evolving “convergent”: Lessons from Control Flow in AMDGPU"
|
I've updated #954 (comment) to include the Apple M1 GPU. |
My example was run on an NVIDIA Quadro P1000 |
Since this PR's branch became an orphan, moving the discussion to #1459 PR. |
* Plan api,operation,memory_sync,texture,* * Address review feedback * formatting
Moved to #1459
Preview WebGPU Changes: https://mehmetoguzderin.github.io/webgpu/webgpu.html
Preview WGSL Changes: https://mehmetoguzderin.github.io/webgpu/wgsl.html
Preview Argdown: https://kvark.github.io/webgpu-debate/SubgroupOps.component.html
This pull request works towards #667 for standard library. For that, the first form of subgroup operations extension to host and device specifications is introduced. Host exposure is directly deducible for all host APIs since it is compute-only, and the set of device instructions is the greatest common factor minus operations that take in a mask or invocation index.
Motivation
Subgroup operations provide speed-up proportional to the subgroup size. They provide a great opportunity to optimize both global and local reduction operations, especially for algorithms that need to specialize general graphs. And their presence is getting more common than ever.
Trade-offs
Lack of Exposed Hardware Banding
Although it is possible to increase market penetration of subgroup operations extension significantly by banding it to permutation and reduction similar to Metal, such direction increases the API surface, possibly crusting for a very narrow use case. Moreover, indicators of next-generation mobile hardware show that they will almost ubiquitously support reduction operations.
Exclusion of Quad Operations
This proposal excludes quad operations from the definition of subgroup operations. New hardware reports on Adreno and PowerVR show lack of quad support. Also, excluding quad operations makes it easier to avoid more ambiguous operations, delegating their presence to a proper quad operations extension.
Exclusion of Indexed or Masked Operations
This proposal excludes indexed or masked operations to avoid undefined behavior on divergence, reconvergence, and possibly out of bounds indexing. The current set of exposed operations are implicitly active on all APIs.
Presence of Extension for APIs
D3D12_FEATURE_DATA_D3D12_OPTIONS1.WaveOps
MTLDevice.supportsFamily(MTLGPUFamilyMac2)
(needs clarification:MTLDevice.supportsFamily(MTLGPUFamilyApple6)
)(VkPhysicalDeviceSubgroupProperties.supportedOperations & (VK_SUBGROUP_FEATURE_BASIC_BIT & VK_SUBGROUP_FEATURE_VOTE_BIT & VK_SUBGROUP_FEATURE_ARITHMETIC_BIT & VK_SUBGROUP_FEATURE_BALLOT_BIT)) & (VkPhysicalDeviceSubgroupProperties.supportedStages & VK_SHADER_STAGE_COMPUTE_BIT)
Related Issues
Preview | Diff