Introduce Subgroup Operations Extension #954

mehmetoguzderin · 2020-07-24T23:40:25Z

Moved to #1459

Preview WebGPU Changes: https://mehmetoguzderin.github.io/webgpu/webgpu.html
Preview WGSL Changes: https://mehmetoguzderin.github.io/webgpu/wgsl.html
Preview Argdown: https://kvark.github.io/webgpu-debate/SubgroupOps.component.html

This pull request works towards #667 for standard library. For that, the first form of subgroup operations extension to host and device specifications is introduced. Host exposure is directly deducible for all host APIs since it is compute-only, and the set of device instructions is the greatest common factor minus operations that take in a mask or invocation index.

Motivation

Subgroup operations provide speed-up proportional to the subgroup size. They provide a great opportunity to optimize both global and local reduction operations, especially for algorithms that need to specialize general graphs. And their presence is getting more common than ever.

Trade-offs

Lack of Exposed Hardware Banding

Although it is possible to increase market penetration of subgroup operations extension significantly by banding it to permutation and reduction similar to Metal, such direction increases the API surface, possibly crusting for a very narrow use case. Moreover, indicators of next-generation mobile hardware show that they will almost ubiquitously support reduction operations.

Exclusion of Quad Operations

This proposal excludes quad operations from the definition of subgroup operations. New hardware reports on Adreno and PowerVR show lack of quad support. Also, excluding quad operations makes it easier to avoid more ambiguous operations, delegating their presence to a proper quad operations extension.

Exclusion of Indexed or Masked Operations

This proposal excludes indexed or masked operations to avoid undefined behavior on divergence, reconvergence, and possibly out of bounds indexing. The current set of exposed operations are implicitly active on all APIs.

Presence of Extension for APIs

DirectX 12	Metal	Vulkan
`D3D12_FEATURE_DATA_D3D12_OPTIONS1.WaveOps`	`MTLDevice.supportsFamily(MTLGPUFamilyMac2)` (needs clarification: `MTLDevice.supportsFamily(MTLGPUFamilyApple6)`)	`(VkPhysicalDeviceSubgroupProperties.supportedOperations & (VK_SUBGROUP_FEATURE_BASIC_BIT & VK_SUBGROUP_FEATURE_VOTE_BIT & VK_SUBGROUP_FEATURE_ARITHMETIC_BIT & VK_SUBGROUP_FEATURE_BALLOT_BIT)) & (VkPhysicalDeviceSubgroupProperties.supportedStages & VK_SHADER_STAGE_COMPUTE_BIT)`

Related Issues

Preview | Diff

litherum · 2020-07-25T01:43:54Z

This looks really good.

Some things that I think need to be stated for the record:

Investigation: Querying Subgroup Support #78 doesn't list a specific use case. I think we should explicitly list at least one motivating example.
It should be stated why the specific subgroup size is only exposed in the shading language and not anywhere else.
It should be stated why subgroup size is exposed as a minimum and maximum value in the API instead of a single value.
It should be stated why the subgroup size is exposed on the compute pipeline object instead of somewhere else in the API.
It should be stated why you chose these specific functions to be present.

Some additional questions:

Once we introduce the concept of subgroups, what else will we need? Will we need SIMD-group barriers? Will we need to incorporate the concept of SIMD into uniformity analysis?
Can we call them "SIMD groups" instead of "subgroups"? I find "SIMD" to be more clear.
I think this is the first extension that affects the shading language. What happens when you use these functions without enabling the extension? What happens when a call to these functions exists, but only in dead code? What happens when a call to these functions exists inside a function that is not transitively reachable from the entry point being compiled?
What are the uniformity requirements? How detailed are we going to have to be regarding specifying "helper threads?"
Shouldn't subgroup_size(), subgroup_local_index(), and subgroup_is_first() just be built-in decorations? Standard library functions that take 0 arguments don't make much sense.
Can subgroup_ballot(bool) return vec2<u32> instead? I'm not aware of any hardware with SIMD width > 64.
What happens if subgroup_broadcast()'s second argument is out-of-bounds?
Do we want subgroup_active_threads_mask() too?
Why shouldn't we specify quad operations in fragment shaders while we're at it? (And what about subgroup operations in fragment shaders?) I didn't quite understand the explanation above in the proposal.

mehmetoguzderin · 2020-07-25T15:08:32Z

This looks really good.

Some things that I think need to be stated for the record:

Investigation: Querying Subgroup Support #78 doesn't list a specific use case. I think we should explicitly list at least one motivating example.

Added to the PR explainer.

It should be stated why the specific subgroup size is only exposed in the shading language and not anywhere else.

It is now exposed in both host and device, that was a mistake related to misreading of DirectX 12 spec, turns out that WaveLaneCountMax is just a reserved name.

It should be stated why subgroup size is exposed as a minimum and maximum value in the API instead of a single value.

Explained in the previous question.

It should be stated why the subgroup size is exposed on the compute pipeline object instead of somewhere else in the API.

That's how Metal API exposes the subgroup size. One can assume that subgroup size should be obtainable by creating a compute pipeline, but if a device (maybe in near future) can execute different kernels in different regions with different subgroup size (which would give rationale to the choice of Metal) depending on requirements of the kernel or power preference, then that could be a wrong assumption.

It should be stated why you chose these specific functions to be present.

Essentially HLSL lacks shuffle and relative shuffle operations and MSL lacks all equal operations. Removing them leaves us with this subset.

Some additional questions:

Once we introduce the concept of subgroups, what else will we need? Will we need SIMD-group barriers? Will we need to incorporate the concept of SIMD into uniformity analysis?

Uniformity analysis could be delegated to another investigation that will also clarify uniformity for derivative functions too, but related spec from DirectX 12: "A set of lanes (threads) executed simultaneously in the processor. No explicit barriers are required to guarantee that they execute in parallel. Similar concepts include 'warp' and 'wavefront.'"

Can we call them "SIMD groups" instead of "subgroups"? I find "SIMD" to be more clear.

I also think that can be better.

I think this is the first extension that affects the shading language. What happens when you use these functions without enabling the extension? What happens when a call to these functions exists, but only in dead code? What happens when a call to these functions exists inside a function that is not transitively reachable from the entry point being compiled?

I think that can also be delegated to another investigation but using these functions without the extensions should be banned and reported to the user at pipeline creation time. Allowing presence of function in places that won't be executed can rather be an implementation detail but giving guarantees for dynamic behavior at rejection on these predicates might be impossible.

What are the uniformity requirements? How detailed are we going to have to be regarding specifying "helper threads?"

It seems like, in the context of compute kernels, the concept of helper threads might be irrelevant. Related specification from Metal: "simd_is_helper_thread(): If this is neither called inside a fragment function nor called inside a function called from a fragment function, the behavior is undefined and the call may cause a compile-time error."

Shouldn't subgroup_size(), subgroup_local_index(), and subgroup_is_first() just be built-in decorations? Standard library functions that take 0 arguments don't make much sense.

Subgroup size can be a built-in decoration but I think the other two should be preserved as functions.

Can subgroup_ballot(bool) return vec2<u32> instead? I'm not aware of any hardware with SIMD width > 64.

That's correct, fixed.

What happens if subgroup_broadcast()'s second argument is out-of-bounds?

Undefined behavior.

Do we want subgroup_active_threads_mask() too?

That function doesn't exist in HLSL but can be easily emulated. Should we include?

Why shouldn't we specify quad operations in fragment shaders while we're at it? (And what about subgroup operations in fragment shaders?) I didn't quite understand the explanation above in the proposal.

Grouping these two together to one extension reduces potential of both operation groups. Firstly, where quad operations exist, the extent of support for subgroup operations isn't large enough to accommodate the reduction operations proposed here. And where subgroup operations exist, there are cases where it is compute only but be accommodating to the set of operations proposed here. Quad operations, where supported, definitely have shuffle operations. If we were to split the subgroup operation in a way that it contains quad operations we would end up with three different extensions and discussion of undefined functions would get more nuanced.

dj2 · 2020-07-26T01:44:28Z

wgsl/index.bs

+    <tr><td>Subgroup built-in functions<td>SPIR-V
+  </thead>
+  <tr><td>subgroup_size() -&gt; u32<td>SubgroupSize
+  <tr><td>subgroup_local_index() -&gt; u32<td>SubgroupLocalInvocationId


These two appear to be builtin's in SPIR-V, not function calls.

Moved, but I am not sure if things could have better tidy

dj2 · 2020-07-26T01:46:16Z

wgsl/index.bs

+  <tr><td>subgroup_or(*T*) -&gt; *T*<td>OpGroupNonUniformBitwiseOr
+  <tr><td>subgroup_xor(*T*) -&gt; *T*<td>OpGroupNonUniformBitwiseXor
+  <tr><td>subgroup_prefix_add(*T*) -&gt; *T*<td>OpGroupNonUniformIAdd or OpGroupNonUniformFAdd with ExclusiveScan
+  <tr><td>subgroup_prefix_mul(*T*) -&gt; *T*<td>OpGroupNonUniformIMul or OpGroupNonUniformFMul with ExclusiveScan


For all these where the param is T, what are the possible values of T?

I have tried to explicitly state the possible types, does it look OK now?

Kangz · 2020-07-27T13:11:57Z

In general I think the subgroup extensions should be done in more piecemeal chunks because not all hardware supports all subgroup functionality. We don't need to match the segmentation of other APIs perfectly (because some amount of emulation is possible), but AFAIK things like quad op aren't supported everywhere and not possible to emulate.

Also shouldn't the subgroup size be a constant for a given device?

Can subgroup_ballot(bool) return vec2 instead? I'm not aware of any hardware with SIMD width > 64.

Imagination has announced hardware with a SIMD width of 128 (yeah that huge, I agree).

mehmetoguzderin · 2020-07-27T14:07:50Z

In general I think the subgroup extensions should be done in more piecemeal chunks because not all hardware supports all subgroup functionality. We don't need to match the segmentation of other APIs perfectly (because some amount of emulation is possible), but AFAIK things like quad op aren't supported everywhere and not possible to emulate.

That's exactly why I excluded quad operations, but I think the proposed subset of subgroup operations here makes the most out of existing API surfaces while considering the mobile hardware to come to market soon. My original idea was to break this PR into two extensions: subgroup-permute and subgroup-reduce but soon enough, all bands of subgroup-permute hardware will migrate to subgroup-reduce. One can further think about subgroup-shuffle and subgroup-relative-shuffle but they are not supported on all APIs and there is no indication that upgrading bands are going to include them too. With these considerations in mind we can deduce a sensible one extension to provide for necessary use cases of subgroup operations.

Also shouldn't the subgroup size be a constant for a given device?

In WWDC20 slides, Apple mentions that applications should rely on limits where they are exposed for a successful transition to Apple Silicon. But again, I am not sure about why the subgroup size was exposed under compute pipeline in Metal, though there can be valid reasons.

Can subgroup_ballot(bool) return vec2 instead? I'm not aware of any hardware with SIMD width > 64.

Imagination has announced hardware with a SIMD width of 128 (yeah that huge, I agree).

Reverted to vec4

mehmetoguzderin · 2020-07-27T20:05:01Z

Vulkan extension mentioned today: VK_EXT_subgroup_size_control

kvark · 2020-07-27T20:08:28Z

Following up on the call today. I asked if on Vulkan/Intel we can ask the driver about what subgroup size does a particular compute pipeline have. If we can do this, a Metal-like threadExecutionWidth query becomes possible. Apparently, there is an extension in Vulkan that exposes this information - VK_KHR_pipeline_executable_properties . It's relatively fresh, and the support on Windows Vulkan/Intel only comes in a few reports, but timeline-wise, it looks like it can be there by the time we ship.

tex3d · 2020-07-28T17:28:09Z

Regarding:

Exclusion of Quad Operations

This proposal excludes quad operations from the definition of subgroup operations. On some APIs, quad operations are strictly fragment-shader only. In contrast, we restrict the use of subgroup operations to compute-shader only to make up for the lack of exposed hardware banding. Thus, any plan to include quad operations should propose and possibly an extension of itself.

This is a link to an old slide deck that predates the implementation of Wave intrinsic support for SM 6.0+. Quad operations are in fact supported in compute shaders for DX in SM 6.0+.

See Quad-Wide Shuffle Operations.

I see further reasoning based on the assumption that quad operations are not supported everywhere subgroup operations are supported (under compute). I don't know if that's entirely based on the assumption around DX, but if it is, further rethinking may be necessary.

mehmetoguzderin · 2020-07-28T18:00:39Z

Regarding:

Exclusion of Quad Operations

This proposal excludes quad operations from the definition of subgroup operations. On some APIs, quad operations are strictly fragment-shader only. In contrast, we restrict the use of subgroup operations to compute-shader only to make up for the lack of exposed hardware banding. Thus, any plan to include quad operations should propose and possibly an extension of itself.

This is a link to an old slide deck that predates the implementation of Wave intrinsic support for SM 6.0+. Quad operations are in fact supported in compute shaders for DX in SM 6.0+.

See Quad-Wide Shuffle Operations.

I see further reasoning based on the assumption that quad operations are not supported everywhere subgroup operations are supported (under compute). I don't know if that's entirely based on the assumption around DX, but if it is, further rethinking may be necessary.

I also took Adreno and PowerVR reports into account, they don't have quad support. But DirectX assumption was a misinterpretation at the investigation's side.

tex3d · 2020-07-28T18:57:05Z

I also took Adreno and PowerVR reports into account, they don't have quad support. But DirectX assumption was a misinterpretation at the investigation's side.

Thanks, I was concerned if the only reason for the limitation was DX.

Kangz · 2020-07-29T17:26:40Z

Ok here's a small investigation to help with the discussion of fixed vs. variable size subgroups.

Investigation on subgroup sizes for WebGPU

Getting adapter/device information about the size of subgroups.

D3D12 has the D3D12_FEATURE_DATA_D3D12_OPTIONS1 . WaveLaneCountMin and . WaveLaneCountMax that say between which subgroup sizes pipeline subgroup sizes will be.

Metal doesn't have an adapter/device query for the subgroup sizes.

Vulkan only has VkPhysicalDeviceSubgroupProperties. subgroupSize which is "the default number of invocations in each subgroup". However on some hardware that's just a hint and the actual subgroup size can vary. All recent drivers for hardware that has variable subgroup size expose the VK_EXT_subgroup_size_control which gives VkPhysicalDeviceSubgroupSizeControlPropertiesEXT . minSubgroupSize and . maxSubgroupSize to query the min/max for a device.

Getting pipeline information about the size of subgroups

D3D12 doesn't have anything.

Metal exposes the actual subgroup size via MTLComputePipelineState.threadExecutionWidth.

Vulkan doesn't have anything in core but VK_EXT_subgroup_size_control allows forcing the size of subgroups at pipeline compilation time via VkPipelineShaderStageRequiredSubgroupSizeCreateInfoEXT.requiredSubgroupSize. VK_KHR_pipeline_executable_properties allows querying the used subgroup size directly.

Devices with variable subgroup size and conclusion

Extracting data from vulkan.gpuinfo.org, most devices have a fixed subgroup size except Intel GPUs (8, 16, or 32) and AMD RDNA GPUs (32 or 64). I think it is safe to assume Apple GPUs have a fixed subgroup size like Imagination GPUs, but it would be nice to have confirmation.

The investigation above shows that we cannot get a per-device/adapter fixed subgroup size because it's not available on Metal. It also shows we cannot get a per-pipeline subgroup size because it's not available on D3D12 (although it could be on Vulkan depending on extensions).

So maybe in a first step we could have subgroups without any API-side query as to what the subgroup size is.

Future possibility

See the section below for why fixed subgroup sizes would be useful, I think it is crucial for WebGPU to have them. What's nice is that almost all GPUs have a fixed subgroup size except:

Intel on D3D12 and Metal (Vulkan has VK_EXT_subgroup_size_control to make it fixed)
AMD RDNA on Metal (the RDNA performance guide says that on D3D12 "RDNA runs shader threads in groups of 32 known as wave32" and Vulkan has VK_EXT_subgroup_size_control to make it fixed)

On D3D12 and Vulkan the API / extensions allow us to know if the subgroup size is fixed by comparing the min and max size, and on Metal we can detect with the vendor/device ID (or GPU family) if the subgroup size is fixed (+get the size by compiling a dummy pipeline).

So I think we could have an extension for fixed subgroup size available everywhere except the configurations listed above. Then if D3D12 allowed controlling the subgroup size (feature request wink wink), only Metal Intel and AMD RDNA wouldn't have the extension. That's probably ok.

Why we want a fixed subgroup size

Variable size subgroups allows for some tricks like scalarization to trade SGPR vs. VGPR but advanced compute algorithm need more control to reach top performance, and to be correct at all.

I was sitting next to the Spinel team that's doing path rasterization using GPU compute in a way that's a different but not 100% unlike this Raph Levien blog post. The pipeline is 100% compute based and uses a ton of subgroup operations for performance. I asked one of their engineers why they need fixed subgroup sizes and paraphrased below:

To have an efficient compute pipeline Spinel has specialization for different hardware based on their shared-memory size, subgroup size etc. The specialization is important as it can help get a 2 to 8x performance boost. It is done at (Spinel) compilation time by using template files that produce GLSL for each configuration, but it assumes that the subgroup size is known as a constant at template-instantiation time.

On Intel the fixed subgroup size chosen at template generation time wasn't necessarily the one used by the driver, leading to data corruption and horribly difficult bugs to figure out. That's why Spinel requires VK_EXT_subgroup_size_control on Intel.

There's various strategies on how a project like Spinel could run on WebGPU: generate templates at page load time based on data from the GPUAdapter, or pre-generate shaders for each subgroup size and load the correct one. Note that all of this requires fixed subgroup sizes to work and it would be a big mistake if WebGPU didn't enforce fixed subgroup size (or control).

Finally it could be nice if in a "basic profile" WebGPU was able to give a "minimum" subgroup size so that all algorithms know they can run with for example subgroup size 8 and request more if available.

Note that Spinel is just one heavy user of subgroups but there are many more. For example the state of the art prefix-sum algorithm "Single-pass Parallel Prefix Scan with Decoupled Look-back" uses subgroups and needs the subgroup size to size arrays in shared memory. There are many more examples.

Other things to figure out.

@dneto0 mentioned that non-uniform subgroup operations made everything extremely complicated. We need to figure out the uniformity constraints and what happens on non-uniform control flow (especially with NVIDIA Ampere having multiple program counters per subgroup).
We need to study the market penetration of each subgroup operation to know how to bucket them in multiple extensions (or whether we need to, maybe we could say everything is available but have a flag on the adapter showing what's emulated).
AFAIK in Vulkan the SubgroupSize builtin can be used as a specialization constant to size arrays, which might remove some of the need for a fixed subgroup size. We should see if the same is possible in HLSL and MSL.

mehmetoguzderin · 2020-07-29T18:12:55Z

I want to comment on a few points:

D3D12 has the D3D12_FEATURE_DATA_D3D12_OPTIONS1 . WaveLaneCountMin and . WaveLaneCountMax that say between which subgroup sizes pipeline subgroup sizes will be.

It is important to note that max variant is just a placeholder according to the specification.

We need to study the market penetration of each subgroup operation to know how to bucket them in multiple extensions (or whether we need to, maybe we could say everything is available but have a flag on the adapter showing what's emulated).

I think doing emulation can kill the benefits on hardware that doesn't have reduction operations (and shuffle, shuffle-relative don't exist in DirectX's HLSL). Considering emulation to make single-code-path variants possible can go all the way to exposing subgroup size as one on hardware not fit enough to present the extension. I think that's not fit for WebGPU and possibly can set a malformed precedent that can make it extra complicated for implementers. In such an environment, the user can fall back to workgroup shared variants of their algorithms instead of relying on the emulation of WebGPU. (Though I think Metal's simdgroup-permute and simdgroup-reduce banding is good and can be adopted if group thinks having two extensions is fine)

mehmetoguzderin · 2020-07-29T18:26:18Z

Also, I'd like to ask the group's opinion on using simdgroup as the name. It creates good analogy and makes all thread groups have the same amount of letters:
workgroup
simdgroup
quadgroup

dneto0 · 2020-07-31T21:54:23Z

Subgroups are an awesome feature. But they have very big caveats, particularly how graphics APIs support them.

There's a distinction between a uniform subgroup model, and a non-uniform subgroup model.

Uniform subgroup model

The cl_khr_subgroups extension to OpenCL C 2.0 (and later in core OpenCL 2.1) uses a uniform subgroup model: each subgroup operation must be collectively executed by all the invocations in the subgroup. If not all invocations participate, then it's undefined behaviour.
For example, from "28.2.4. Additions to section 6.13.15 — Work Group Functions" in the linked version of the OpenCL 2.2 extensions spec:

The OpenCL C programming language implements the following built-in functions that operate on a subgroup level. These built-in functions must be encountered by all work items in a subgroup executing the kernel.

Nonuniform subgroup model

Vulkan, D3D, and Metal use a non-uniform model: Not all invocations in a subgroup have to execute a subgroup operation.

Ambiguities about divergence

When invocations in a subgroup are executing "together", how long is that guaranteed?

E.g. a subgroup barrier causes invocations to in a subgroup to wait for each other before any can continue executing. They are subgroup-uniform when the leave the barrier. But how long does that last? Certainly it's broken by control flow where different invocations take different paths. But can it be broken sooner than that?

  subgroup_barrier();
  x = cos(y);   // divergence here?  What if it's a user function? Or some other complex expression?
  if ( subgroup_id % 2 == 0 ) { // pretty certain divergence here
  }

Ambiguities about reconvergence

Results of subgroup operations are affected by which invocations are executing together. (Or whether you even have undefined behaviour.) But once invocations diverge, what are the guarantees about where you reconverge?

Vulkan/SPIR-V has extremely weak guarantees: You either have full workgroup uniformity, or you don't. Getting back to full workgroup-uniformity requires all invocations in the workgroup exiting a structured control flow construct (to its merge block) where the whole workgroup had collectively entered that construct. There is no rule for finer grain reconvergence.

I don't think D3D has anything stronger than that.

I don't know enough about MSL here.

Ambiguities about forward progress

Subgroups introduce a question of forward progress:

Does one subgroup block progress being made by a different subgroup? Under what conditions?
When a subgroup is executing non-uniformly, do some invocations block progress by other invocations in the same subgroup?

D3D, Metal, and Vulkan are silent on both of these. This leads to non-portability.

This interacts very strongly with atomics (and loops).

Ambiguities about helper invocations

In a fragment shader, some invocations could be helper invocations, or could have been converted to one by a (D3D-style) discard. Do those helper invocations participate in subgroup operations?

I believe Vulkan/SPIR-V is silent on this, and there may be different behaviours.

MSL says that helper invocations are not "active".

I don't know enough about D3D to say.

Summary

Subgroups have many sharp corners for introducing ambiguity, non-portability, and undefined behaviour. The target APIs have sufficiently tight rules to allow good, portable, and reliable subgroup features.
I think subgroups are not a good candidate for inclusion in WebGPU MVP.

litherum · 2020-08-03T19:56:47Z

WebGPU telecon today approved the API-side of this PR.

The shading-language side of this PR still needs to be approved.

mehmetoguzderin · 2020-09-10T11:51:40Z

Based on the sample I built for the W3C Machine Learning Workshop, I have compared the execution speed with atomic (since atomics do not support floating-point operations, numerical loss happens with them) and shared alternatives. It turns out the SIMD version, which only utilizes the set provided in this PR, beats others by at least 2x on both Intel and Nvidia hardware. Such execution time difference can make a real impact on exploratory data analysis applications and potentially any application that hopes to run on GPU in a portable setting (to avoid battery drain and heat). And this is consistent with the findings of state-of-the-art particle methods, as mentioned in this slide deck, which gain ~10x speed increase with SIMD operations.

litherum · 2020-09-12T05:21:23Z

I transcribed @dneto0's above example to Metal: Convergence.zip

litherum · 2020-09-12T07:54:55Z

And here it is transcribed to D3D12: Convergence.zip

dj2 · 2020-09-14T17:26:26Z

@litherum did you run them on Metal and D3D12? Did you get results similar to Vulkan where it diverges?

litherum · 2020-09-16T18:40:13Z

I did some data gathering:

OS	Vendor	GPU	Behavior
Windows	Intel	Intel(R) HD Graphics 520	Hang
Windows	Intel	Intel(R) UHD Graphics 620	✅
Windows	AMD	Radeon RX 560 Series	✅
Windows	Nvidia	NVIDIA GeForce GTX 965M	✅
Windows	Nvidia	NVIDIA GeForce GTX 1060	✅
Windows	Nvidia	NVIDIA GeForce RTX 2080 Ti	✅
Windows	Microsoft	Microsoft Basic Render Driver	✅
Windows	Qualcomm	Qualcomm(R) Adreno(TM) 680 GPU	Doesn't support wave ops
macOS	AMD	AMD Radeon RX 570	✅
macOS	AMD	AMD Radeon RX 560	✅
macOS	AMD	AMD Radeon Pro 560	✅
macOS	AMD	AMD Radeon Pro 570	✅
macOS	AMD	AMD Radeon Pro Vega 56	✅
macOS	AMD	AMD Radeon HD - FirePro D500	✅
macOS	Intel	Intel(R) HD Graphics 630	✅
macOS	Intel	Intel(R) HD Graphics 5300	✅
macOS	Intel	Intel(R) HD Graphics 515	✅
macOS	Apple	Apple M1	`255 255 255 255` rdar://problem/73006980

"✅" means the output was "255 240 240 240".
The Windows machines used D3D12, the macOS machines used Metal.

qjia7 · 2020-09-18T06:13:07Z

@litherum I didn’t meet the hang issue on Intel(R) HD Graphics 520 with driver 27.20.100.8587 on windows. The result is 255 240 240 240.

litherum · 2020-09-21T17:58:59Z

@qjia7

@litherum I didn’t meet the hang issue on Intel(R) HD Graphics 520 with driver 27.20.100.8587 on windows. The result is 255 240 240 240.

I'm on driver 24.20.100.6293. This is the one that Windows Update gave me. Is there a utility outside of Windows Update to update an Intel GPU driver?

litherum · 2020-09-21T18:23:51Z

@dneto0

my desktop GPU, a mainstream NVIDIA workstation GPU card

Which card is it? We'd like to investigate further.

kainino0x · 2020-09-21T20:53:20Z

@litherum you probably have the latest OEM Intel graphics drivers for your system. To get newer drivers you can try the Intel driver assistant, but if it says something like 'you have OEM drivers that can't be updated by the assistant', I think you can download a specific driver package, probably this one. It might disable some hardware integration (like certain display-panel-specific features).

qjia7 · 2020-09-22T00:53:03Z

Is there a utility outside of Windows Update to update an Intel GPU driver?

Like Kai said, you need to download a specific driver package from here. But it may reject you to upgrade the driver. The solution is to uninstall driver via device manager, then install the downloaded Intel driver package.

gyagp · 2020-09-22T02:41:56Z

You may find the detailed instructions to install Intel Graphics Driver on OEM devices at https://docs.google.com/document/d/1Fr5hi6BqlLVaJJoZEN7sGjukF4kM2qOAFb8mtbYx1Fo/edit#heading=h.4rbfm5zbtbyd

grorg · 2020-09-29T18:40:19Z

Discussed at the 2020-09-29 meeting.

dneto0 · 2020-10-07T22:10:06Z

FYI. Nicolai Hähnle will be presenting at the LLVM Dev Meeting on Thursday (October 8). "Evolving “convergent”: Lessons from Control Flow in AMDGPU"
The abstract is:

GPUs execute many threads of a program in lock-step by mapping them to lanes of a SIMD vector that we call “wave”. Modern GPU programming languages have cross-lane operations such as shuffles, ballots, and barriers that exchange data between the lanes of a wave. When such operations execute in divergent control flow (lanes of a wave following different paths through the CFG), only a subset of lanes participate in this data exchange. A key part of defining the semantics of cross-lane operations is defining how this subset is determined.

In LLVM, the only tool available today to help in this definition is the convergent attribute. We argue that its definition is subtly broken and insufficient for expressing and preserving the desired behavior of cross-lane operations. We propose a new definition of convergent as well as the concept of “convergence tokens” and related intrinsics that allow frontends to describe the desired semantics of cross-lane operations in IR in a way that is easy to maintain by generic transforms. We also briefly touch on how these intrinsics are used by a new “wave transform” (whole program vectorization that lowers from thread-level CFG to wave-level CFG) in the AMDGPU backend.

https://llvm.org/devmtg/2020-09/schedule/

litherum · 2021-01-09T03:16:05Z

I've updated #954 (comment) to include the Apple M1 GPU.

dneto0 · 2021-01-12T19:08:03Z

My example was run on an NVIDIA Quadro P1000

mehmetoguzderin · 2021-02-23T14:36:07Z

Since this PR's branch became an orphan, moving the discussion to #1459 PR.

* Plan api,operation,memory_sync,texture,* * Address review feedback * formatting

mehmetoguzderin added 11 commits July 24, 2020 22:41

Update index.bs

57c51e6

Update index.bs

30265b0

Update index.bs

fb72312

Update index.bs

eff4448

Update index.bs

b458051

Update index.bs

dadbdf5

Update index.bs

7b08f77

Update index.bs

7f40783

Add operations

649d3d0

Format templated parameter

1eb1337

Fix comment

364141d

Reduce attributes, fix ballot

6430da5

dj2 reviewed Jul 26, 2020

View reviewed changes

Denormalize definitions

98e9ac9

Increase ballot return size

ed13541

Update index.bs

5423c09

resolve conflict

d06ad89

kvark mentioned this pull request Sep 16, 2020

Usage of atomicAdd() causes shader compilation error gfx-rs/wgpu-rs#564

Closed

mehmetoguzderin added 2 commits October 24, 2020 23:12

Remove leader concept

4862ae2

Test CI

12b284e

kdashg added this to the post-MVP milestone Jan 12, 2021

cwfitzgerald mentioned this pull request Feb 12, 2021

Implement Subgroup Ops gfx-rs/wgpu#1212

Closed

dj2 added the wgsl WebGPU Shading Language Issues label Feb 17, 2021

mehmetoguzderin closed this Feb 23, 2021

mehmetoguzderin mentioned this pull request Feb 23, 2021

Introduce Subgroup Operations Extension #1459

Closed

krogovin mentioned this pull request Sep 21, 2021

Request for compute: anyInvocation() and allInvocation() #2137

Open

dneto0 mentioned this pull request Jun 3, 2022

uniformity analysis + basic tail-branching #3007

Closed

ben-clayton pushed a commit to ben-clayton/gpuweb that referenced this pull request Sep 6, 2022

Plan api,operation,memory_sync,texture,* (gpuweb#954)

672773a

* Plan api,operation,memory_sync,texture,* * Address review feedback * formatting

dneto0 mentioned this pull request Dec 7, 2022

Uniform control flow vs textures for flat interpolated varyings #3668

Closed

dneto0 mentioned this pull request Sep 26, 2023

add subgroups, and make them portable if possible #4306

Open

Introduce Subgroup Operations Extension #954

Introduce Subgroup Operations Extension #954

Uh oh!

Conversation

mehmetoguzderin commented Jul 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Moved to #1459

Motivation

Trade-offs

Lack of Exposed Hardware Banding

Exclusion of Quad Operations

Exclusion of Indexed or Masked Operations

Presence of Extension for APIs

Related Issues

Uh oh!

litherum commented Jul 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mehmetoguzderin commented Jul 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dj2 Jul 26, 2020

Choose a reason for hiding this comment

Uh oh!

mehmetoguzderin Jul 26, 2020

Choose a reason for hiding this comment

Uh oh!

dj2 Jul 26, 2020

Choose a reason for hiding this comment

Uh oh!

mehmetoguzderin Jul 26, 2020

Choose a reason for hiding this comment

Uh oh!

Kangz commented Jul 27, 2020

Uh oh!

mehmetoguzderin commented Jul 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mehmetoguzderin commented Jul 27, 2020

Uh oh!

kvark commented Jul 27, 2020

Uh oh!

tex3d commented Jul 28, 2020

Exclusion of Quad Operations

Uh oh!

mehmetoguzderin commented Jul 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Exclusion of Quad Operations

Uh oh!

tex3d commented Jul 28, 2020

Uh oh!

Kangz commented Jul 29, 2020

Investigation on subgroup sizes for WebGPU

Getting adapter/device information about the size of subgroups.

Getting pipeline information about the size of subgroups

Devices with variable subgroup size and conclusion

Future possibility

Why we want a fixed subgroup size

Other things to figure out.

Uh oh!

mehmetoguzderin commented Jul 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mehmetoguzderin commented Jul 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dneto0 commented Jul 31, 2020

Uniform subgroup model

Nonuniform subgroup model

Ambiguities about divergence

Ambiguities about reconvergence

Ambiguities about forward progress

Ambiguities about helper invocations

Summary

Uh oh!

litherum commented Aug 3, 2020

Uh oh!

mehmetoguzderin commented Sep 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mehmetoguzderin commented Jul 24, 2020 •

edited

Loading

litherum commented Jul 25, 2020 •

edited

Loading

mehmetoguzderin commented Jul 25, 2020 •

edited

Loading

mehmetoguzderin commented Jul 27, 2020 •

edited

Loading

mehmetoguzderin commented Jul 28, 2020 •

edited

Loading

mehmetoguzderin commented Jul 29, 2020 •

edited

Loading

mehmetoguzderin commented Jul 29, 2020 •

edited

Loading

mehmetoguzderin commented Sep 10, 2020 •

edited

Loading

litherum commented Sep 16, 2020 •

edited

Loading

litherum commented Sep 21, 2020 •

edited

Loading