Skip to content

Introduce Subgroup Operations Extension #954

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 26 commits into from
Closed

Introduce Subgroup Operations Extension #954

wants to merge 26 commits into from

Conversation

mehmetoguzderin
Copy link
Member

@mehmetoguzderin mehmetoguzderin commented Jul 24, 2020

Moved to #1459

Preview WebGPU Changes: https://mehmetoguzderin.github.io/webgpu/webgpu.html
Preview WGSL Changes: https://mehmetoguzderin.github.io/webgpu/wgsl.html
Preview Argdown: https://kvark.github.io/webgpu-debate/SubgroupOps.component.html

This pull request works towards #667 for standard library. For that, the first form of subgroup operations extension to host and device specifications is introduced. Host exposure is directly deducible for all host APIs since it is compute-only, and the set of device instructions is the greatest common factor minus operations that take in a mask or invocation index.

Motivation

Subgroup operations provide speed-up proportional to the subgroup size. They provide a great opportunity to optimize both global and local reduction operations, especially for algorithms that need to specialize general graphs. And their presence is getting more common than ever.

Trade-offs

Lack of Exposed Hardware Banding

Although it is possible to increase market penetration of subgroup operations extension significantly by banding it to permutation and reduction similar to Metal, such direction increases the API surface, possibly crusting for a very narrow use case. Moreover, indicators of next-generation mobile hardware show that they will almost ubiquitously support reduction operations.

Exclusion of Quad Operations

This proposal excludes quad operations from the definition of subgroup operations. New hardware reports on Adreno and PowerVR show lack of quad support. Also, excluding quad operations makes it easier to avoid more ambiguous operations, delegating their presence to a proper quad operations extension.

Exclusion of Indexed or Masked Operations

This proposal excludes indexed or masked operations to avoid undefined behavior on divergence, reconvergence, and possibly out of bounds indexing. The current set of exposed operations are implicitly active on all APIs.

Presence of Extension for APIs

DirectX 12 Metal Vulkan
D3D12_FEATURE_DATA_D3D12_OPTIONS1.WaveOps MTLDevice.supportsFamily(MTLGPUFamilyMac2) (needs clarification: MTLDevice.supportsFamily(MTLGPUFamilyApple6)) (VkPhysicalDeviceSubgroupProperties.supportedOperations & (VK_SUBGROUP_FEATURE_BASIC_BIT & VK_SUBGROUP_FEATURE_VOTE_BIT & VK_SUBGROUP_FEATURE_ARITHMETIC_BIT & VK_SUBGROUP_FEATURE_BALLOT_BIT)) & (VkPhysicalDeviceSubgroupProperties.supportedStages & VK_SHADER_STAGE_COMPUTE_BIT)

Related Issues


Preview | Diff

@litherum
Copy link
Contributor

litherum commented Jul 25, 2020

This looks really good.

Some things that I think need to be stated for the record:

  1. Investigation: Querying Subgroup Support #78 doesn't list a specific use case. I think we should explicitly list at least one motivating example.
  2. It should be stated why the specific subgroup size is only exposed in the shading language and not anywhere else.
  3. It should be stated why subgroup size is exposed as a minimum and maximum value in the API instead of a single value.
  4. It should be stated why the subgroup size is exposed on the compute pipeline object instead of somewhere else in the API.
  5. It should be stated why you chose these specific functions to be present.

Some additional questions:

  1. Once we introduce the concept of subgroups, what else will we need? Will we need SIMD-group barriers? Will we need to incorporate the concept of SIMD into uniformity analysis?
  2. Can we call them "SIMD groups" instead of "subgroups"? I find "SIMD" to be more clear.
  3. I think this is the first extension that affects the shading language. What happens when you use these functions without enabling the extension? What happens when a call to these functions exists, but only in dead code? What happens when a call to these functions exists inside a function that is not transitively reachable from the entry point being compiled?
  4. What are the uniformity requirements? How detailed are we going to have to be regarding specifying "helper threads?"
  5. Shouldn't subgroup_size(), subgroup_local_index(), and subgroup_is_first() just be built-in decorations? Standard library functions that take 0 arguments don't make much sense.
  6. Can subgroup_ballot(bool) return vec2<u32> instead? I'm not aware of any hardware with SIMD width > 64.
  7. What happens if subgroup_broadcast()'s second argument is out-of-bounds?
  8. Do we want subgroup_active_threads_mask() too?
  9. Why shouldn't we specify quad operations in fragment shaders while we're at it? (And what about subgroup operations in fragment shaders?) I didn't quite understand the explanation above in the proposal.

@mehmetoguzderin
Copy link
Member Author

mehmetoguzderin commented Jul 25, 2020

This looks really good.

Some things that I think need to be stated for the record:

  1. Investigation: Querying Subgroup Support #78 doesn't list a specific use case. I think we should explicitly list at least one motivating example.

Added to the PR explainer.

  1. It should be stated why the specific subgroup size is only exposed in the shading language and not anywhere else.

It is now exposed in both host and device, that was a mistake related to misreading of DirectX 12 spec, turns out that WaveLaneCountMax is just a reserved name.

  1. It should be stated why subgroup size is exposed as a minimum and maximum value in the API instead of a single value.

Explained in the previous question.

  1. It should be stated why the subgroup size is exposed on the compute pipeline object instead of somewhere else in the API.

That's how Metal API exposes the subgroup size. One can assume that subgroup size should be obtainable by creating a compute pipeline, but if a device (maybe in near future) can execute different kernels in different regions with different subgroup size (which would give rationale to the choice of Metal) depending on requirements of the kernel or power preference, then that could be a wrong assumption.

  1. It should be stated why you chose these specific functions to be present.

Essentially HLSL lacks shuffle and relative shuffle operations and MSL lacks all equal operations. Removing them leaves us with this subset.

Some additional questions:

  1. Once we introduce the concept of subgroups, what else will we need? Will we need SIMD-group barriers? Will we need to incorporate the concept of SIMD into uniformity analysis?

Uniformity analysis could be delegated to another investigation that will also clarify uniformity for derivative functions too, but related spec from DirectX 12: "A set of lanes (threads) executed simultaneously in the processor. No explicit barriers are required to guarantee that they execute in parallel. Similar concepts include 'warp' and 'wavefront.'"

  1. Can we call them "SIMD groups" instead of "subgroups"? I find "SIMD" to be more clear.

I also think that can be better.

  1. I think this is the first extension that affects the shading language. What happens when you use these functions without enabling the extension? What happens when a call to these functions exists, but only in dead code? What happens when a call to these functions exists inside a function that is not transitively reachable from the entry point being compiled?

I think that can also be delegated to another investigation but using these functions without the extensions should be banned and reported to the user at pipeline creation time. Allowing presence of function in places that won't be executed can rather be an implementation detail but giving guarantees for dynamic behavior at rejection on these predicates might be impossible.

  1. What are the uniformity requirements? How detailed are we going to have to be regarding specifying "helper threads?"

It seems like, in the context of compute kernels, the concept of helper threads might be irrelevant. Related specification from Metal: "simd_is_helper_thread(): If this is neither called inside a fragment function nor called inside a function called from a fragment function, the behavior is undefined and the call may cause a compile-time error."

  1. Shouldn't subgroup_size(), subgroup_local_index(), and subgroup_is_first() just be built-in decorations? Standard library functions that take 0 arguments don't make much sense.

Subgroup size can be a built-in decoration but I think the other two should be preserved as functions.

  1. Can subgroup_ballot(bool) return vec2<u32> instead? I'm not aware of any hardware with SIMD width > 64.

That's correct, fixed.

  1. What happens if subgroup_broadcast()'s second argument is out-of-bounds?

Undefined behavior.

  1. Do we want subgroup_active_threads_mask() too?

That function doesn't exist in HLSL but can be easily emulated. Should we include?

  1. Why shouldn't we specify quad operations in fragment shaders while we're at it? (And what about subgroup operations in fragment shaders?) I didn't quite understand the explanation above in the proposal.

Grouping these two together to one extension reduces potential of both operation groups. Firstly, where quad operations exist, the extent of support for subgroup operations isn't large enough to accommodate the reduction operations proposed here. And where subgroup operations exist, there are cases where it is compute only but be accommodating to the set of operations proposed here. Quad operations, where supported, definitely have shuffle operations. If we were to split the subgroup operation in a way that it contains quad operations we would end up with three different extensions and discussion of undefined functions would get more nuanced.

wgsl/index.bs Outdated
<tr><td>Subgroup built-in functions<td>SPIR-V
</thead>
<tr><td>subgroup_size() -&gt; u32<td>SubgroupSize
<tr><td>subgroup_local_index() -&gt; u32<td>SubgroupLocalInvocationId
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two appear to be builtin's in SPIR-V, not function calls.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved, but I am not sure if things could have better tidy

wgsl/index.bs Outdated
<tr><td>subgroup_or(*T*) -&gt; *T*<td>OpGroupNonUniformBitwiseOr
<tr><td>subgroup_xor(*T*) -&gt; *T*<td>OpGroupNonUniformBitwiseXor
<tr><td>subgroup_prefix_add(*T*) -&gt; *T*<td>OpGroupNonUniformIAdd or OpGroupNonUniformFAdd with ExclusiveScan
<tr><td>subgroup_prefix_mul(*T*) -&gt; *T*<td>OpGroupNonUniformIMul or OpGroupNonUniformFMul with ExclusiveScan
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For all these where the param is T, what are the possible values of T?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tried to explicitly state the possible types, does it look OK now?

@Kangz
Copy link
Contributor

Kangz commented Jul 27, 2020

In general I think the subgroup extensions should be done in more piecemeal chunks because not all hardware supports all subgroup functionality. We don't need to match the segmentation of other APIs perfectly (because some amount of emulation is possible), but AFAIK things like quad op aren't supported everywhere and not possible to emulate.

Also shouldn't the subgroup size be a constant for a given device?

  1. Can subgroup_ballot(bool) return vec2 instead? I'm not aware of any hardware with SIMD width > 64.

Imagination has announced hardware with a SIMD width of 128 (yeah that huge, I agree).

@mehmetoguzderin
Copy link
Member Author

mehmetoguzderin commented Jul 27, 2020

In general I think the subgroup extensions should be done in more piecemeal chunks because not all hardware supports all subgroup functionality. We don't need to match the segmentation of other APIs perfectly (because some amount of emulation is possible), but AFAIK things like quad op aren't supported everywhere and not possible to emulate.

That's exactly why I excluded quad operations, but I think the proposed subset of subgroup operations here makes the most out of existing API surfaces while considering the mobile hardware to come to market soon. My original idea was to break this PR into two extensions: subgroup-permute and subgroup-reduce but soon enough, all bands of subgroup-permute hardware will migrate to subgroup-reduce. One can further think about subgroup-shuffle and subgroup-relative-shuffle but they are not supported on all APIs and there is no indication that upgrading bands are going to include them too. With these considerations in mind we can deduce a sensible one extension to provide for necessary use cases of subgroup operations.

Also shouldn't the subgroup size be a constant for a given device?

In WWDC20 slides, Apple mentions that applications should rely on limits where they are exposed for a successful transition to Apple Silicon. But again, I am not sure about why the subgroup size was exposed under compute pipeline in Metal, though there can be valid reasons.

  1. Can subgroup_ballot(bool) return vec2 instead? I'm not aware of any hardware with SIMD width > 64.

Imagination has announced hardware with a SIMD width of 128 (yeah that huge, I agree).

Reverted to vec4

@mehmetoguzderin
Copy link
Member Author

Vulkan extension mentioned today: VK_EXT_subgroup_size_control

@kvark
Copy link
Contributor

kvark commented Jul 27, 2020

Following up on the call today. I asked if on Vulkan/Intel we can ask the driver about what subgroup size does a particular compute pipeline have. If we can do this, a Metal-like threadExecutionWidth query becomes possible. Apparently, there is an extension in Vulkan that exposes this information - VK_KHR_pipeline_executable_properties . It's relatively fresh, and the support on Windows Vulkan/Intel only comes in a few reports, but timeline-wise, it looks like it can be there by the time we ship.

@tex3d
Copy link

tex3d commented Jul 28, 2020

Regarding:

Exclusion of Quad Operations

This proposal excludes quad operations from the definition of subgroup operations. On some APIs, quad operations are strictly fragment-shader only. In contrast, we restrict the use of subgroup operations to compute-shader only to make up for the lack of exposed hardware banding. Thus, any plan to include quad operations should propose and possibly an extension of itself.

This is a link to an old slide deck that predates the implementation of Wave intrinsic support for SM 6.0+. Quad operations are in fact supported in compute shaders for DX in SM 6.0+.

See Quad-Wide Shuffle Operations.

I see further reasoning based on the assumption that quad operations are not supported everywhere subgroup operations are supported (under compute). I don't know if that's entirely based on the assumption around DX, but if it is, further rethinking may be necessary.

@mehmetoguzderin
Copy link
Member Author

mehmetoguzderin commented Jul 28, 2020

Regarding:

Exclusion of Quad Operations

This proposal excludes quad operations from the definition of subgroup operations. On some APIs, quad operations are strictly fragment-shader only. In contrast, we restrict the use of subgroup operations to compute-shader only to make up for the lack of exposed hardware banding. Thus, any plan to include quad operations should propose and possibly an extension of itself.

This is a link to an old slide deck that predates the implementation of Wave intrinsic support for SM 6.0+. Quad operations are in fact supported in compute shaders for DX in SM 6.0+.

See Quad-Wide Shuffle Operations.

I see further reasoning based on the assumption that quad operations are not supported everywhere subgroup operations are supported (under compute). I don't know if that's entirely based on the assumption around DX, but if it is, further rethinking may be necessary.

I also took Adreno and PowerVR reports into account, they don't have quad support. But DirectX assumption was a misinterpretation at the investigation's side.

@tex3d
Copy link

tex3d commented Jul 28, 2020

I also took Adreno and PowerVR reports into account, they don't have quad support. But DirectX assumption was a misinterpretation at the investigation's side.

Thanks, I was concerned if the only reason for the limitation was DX.

@Kangz
Copy link
Contributor

Kangz commented Jul 29, 2020

Ok here's a small investigation to help with the discussion of fixed vs. variable size subgroups.

Investigation on subgroup sizes for WebGPU

Getting adapter/device information about the size of subgroups.

D3D12 has the D3D12_FEATURE_DATA_D3D12_OPTIONS1 . WaveLaneCountMin and . WaveLaneCountMax that say between which subgroup sizes pipeline subgroup sizes will be.

Metal doesn't have an adapter/device query for the subgroup sizes.

Vulkan only has VkPhysicalDeviceSubgroupProperties. subgroupSize which is "the default number of invocations in each subgroup". However on some hardware that's just a hint and the actual subgroup size can vary. All recent drivers for hardware that has variable subgroup size expose the VK_EXT_subgroup_size_control which gives VkPhysicalDeviceSubgroupSizeControlPropertiesEXT . minSubgroupSize and . maxSubgroupSize to query the min/max for a device.

Getting pipeline information about the size of subgroups

D3D12 doesn't have anything.

Metal exposes the actual subgroup size via MTLComputePipelineState.threadExecutionWidth.

Vulkan doesn't have anything in core but VK_EXT_subgroup_size_control allows forcing the size of subgroups at pipeline compilation time via VkPipelineShaderStageRequiredSubgroupSizeCreateInfoEXT.requiredSubgroupSize. VK_KHR_pipeline_executable_properties allows querying the used subgroup size directly.

Devices with variable subgroup size and conclusion

Extracting data from vulkan.gpuinfo.org, most devices have a fixed subgroup size except Intel GPUs (8, 16, or 32) and AMD RDNA GPUs (32 or 64). I think it is safe to assume Apple GPUs have a fixed subgroup size like Imagination GPUs, but it would be nice to have confirmation.

The investigation above shows that we cannot get a per-device/adapter fixed subgroup size because it's not available on Metal. It also shows we cannot get a per-pipeline subgroup size because it's not available on D3D12 (although it could be on Vulkan depending on extensions).

So maybe in a first step we could have subgroups without any API-side query as to what the subgroup size is.

Future possibility

See the section below for why fixed subgroup sizes would be useful, I think it is crucial for WebGPU to have them. What's nice is that almost all GPUs have a fixed subgroup size except:

  • Intel on D3D12 and Metal (Vulkan has VK_EXT_subgroup_size_control to make it fixed)
  • AMD RDNA on Metal (the RDNA performance guide says that on D3D12 "RDNA runs shader threads in groups of 32 known as wave32" and Vulkan has VK_EXT_subgroup_size_control to make it fixed)

On D3D12 and Vulkan the API / extensions allow us to know if the subgroup size is fixed by comparing the min and max size, and on Metal we can detect with the vendor/device ID (or GPU family) if the subgroup size is fixed (+get the size by compiling a dummy pipeline).

So I think we could have an extension for fixed subgroup size available everywhere except the configurations listed above. Then if D3D12 allowed controlling the subgroup size (feature request wink wink), only Metal Intel and AMD RDNA wouldn't have the extension. That's probably ok.

Why we want a fixed subgroup size

Variable size subgroups allows for some tricks like scalarization to trade SGPR vs. VGPR but advanced compute algorithm need more control to reach top performance, and to be correct at all.

I was sitting next to the Spinel team that's doing path rasterization using GPU compute in a way that's a different but not 100% unlike this Raph Levien blog post. The pipeline is 100% compute based and uses a ton of subgroup operations for performance. I asked one of their engineers why they need fixed subgroup sizes and paraphrased below:

To have an efficient compute pipeline Spinel has specialization for different hardware based on their shared-memory size, subgroup size etc. The specialization is important as it can help get a 2 to 8x performance boost. It is done at (Spinel) compilation time by using template files that produce GLSL for each configuration, but it assumes that the subgroup size is known as a constant at template-instantiation time.

On Intel the fixed subgroup size chosen at template generation time wasn't necessarily the one used by the driver, leading to data corruption and horribly difficult bugs to figure out. That's why Spinel requires VK_EXT_subgroup_size_control on Intel.

There's various strategies on how a project like Spinel could run on WebGPU: generate templates at page load time based on data from the GPUAdapter, or pre-generate shaders for each subgroup size and load the correct one. Note that all of this requires fixed subgroup sizes to work and it would be a big mistake if WebGPU didn't enforce fixed subgroup size (or control).

Finally it could be nice if in a "basic profile" WebGPU was able to give a "minimum" subgroup size so that all algorithms know they can run with for example subgroup size 8 and request more if available.

Note that Spinel is just one heavy user of subgroups but there are many more. For example the state of the art prefix-sum algorithm "Single-pass Parallel Prefix Scan with Decoupled Look-back" uses subgroups and needs the subgroup size to size arrays in shared memory. There are many more examples.

Other things to figure out.

  • @dneto0 mentioned that non-uniform subgroup operations made everything extremely complicated. We need to figure out the uniformity constraints and what happens on non-uniform control flow (especially with NVIDIA Ampere having multiple program counters per subgroup).
  • We need to study the market penetration of each subgroup operation to know how to bucket them in multiple extensions (or whether we need to, maybe we could say everything is available but have a flag on the adapter showing what's emulated).
  • AFAIK in Vulkan the SubgroupSize builtin can be used as a specialization constant to size arrays, which might remove some of the need for a fixed subgroup size. We should see if the same is possible in HLSL and MSL.

@mehmetoguzderin
Copy link
Member Author

mehmetoguzderin commented Jul 29, 2020

I want to comment on a few points:

D3D12 has the D3D12_FEATURE_DATA_D3D12_OPTIONS1 . WaveLaneCountMin and . WaveLaneCountMax that say between which subgroup sizes pipeline subgroup sizes will be.

It is important to note that max variant is just a placeholder according to the specification.

  • We need to study the market penetration of each subgroup operation to know how to bucket them in multiple extensions (or whether we need to, maybe we could say everything is available but have a flag on the adapter showing what's emulated).

I think doing emulation can kill the benefits on hardware that doesn't have reduction operations (and shuffle, shuffle-relative don't exist in DirectX's HLSL). Considering emulation to make single-code-path variants possible can go all the way to exposing subgroup size as one on hardware not fit enough to present the extension. I think that's not fit for WebGPU and possibly can set a malformed precedent that can make it extra complicated for implementers. In such an environment, the user can fall back to workgroup shared variants of their algorithms instead of relying on the emulation of WebGPU. (Though I think Metal's simdgroup-permute and simdgroup-reduce banding is good and can be adopted if group thinks having two extensions is fine)

@mehmetoguzderin
Copy link
Member Author

mehmetoguzderin commented Jul 29, 2020

Also, I'd like to ask the group's opinion on using simdgroup as the name. It creates good analogy and makes all thread groups have the same amount of letters:
workgroup
simdgroup
quadgroup

@dneto0
Copy link
Contributor

dneto0 commented Jul 31, 2020

Subgroups are an awesome feature. But they have very big caveats, particularly how graphics APIs support them.

There's a distinction between a uniform subgroup model, and a non-uniform subgroup model.

Uniform subgroup model

The cl_khr_subgroups extension to OpenCL C 2.0 (and later in core OpenCL 2.1) uses a uniform subgroup model: each subgroup operation must be collectively executed by all the invocations in the subgroup. If not all invocations participate, then it's undefined behaviour.
For example, from "28.2.4. Additions to section 6.13.15 — Work Group Functions" in the linked version of the OpenCL 2.2 extensions spec:

The OpenCL C programming language implements the following built-in functions that operate on a subgroup level. These built-in functions must be encountered by all work items in a subgroup executing the kernel.

Nonuniform subgroup model

Vulkan, D3D, and Metal use a non-uniform model: Not all invocations in a subgroup have to execute a subgroup operation.

Ambiguities about divergence

When invocations in a subgroup are executing "together", how long is that guaranteed?

E.g. a subgroup barrier causes invocations to in a subgroup to wait for each other before any can continue executing. They are subgroup-uniform when the leave the barrier. But how long does that last? Certainly it's broken by control flow where different invocations take different paths. But can it be broken sooner than that?

  subgroup_barrier();
  x = cos(y);   // divergence here?  What if it's a user function? Or some other complex expression?
  if ( subgroup_id % 2 == 0 ) { // pretty certain divergence here
  }

Ambiguities about reconvergence

Results of subgroup operations are affected by which invocations are executing together. (Or whether you even have undefined behaviour.) But once invocations diverge, what are the guarantees about where you reconverge?

Vulkan/SPIR-V has extremely weak guarantees: You either have full workgroup uniformity, or you don't. Getting back to full workgroup-uniformity requires all invocations in the workgroup exiting a structured control flow construct (to its merge block) where the whole workgroup had collectively entered that construct. There is no rule for finer grain reconvergence.

I don't think D3D has anything stronger than that.

I don't know enough about MSL here.

Ambiguities about forward progress

Subgroups introduce a question of forward progress:

  • Does one subgroup block progress being made by a different subgroup? Under what conditions?
  • When a subgroup is executing non-uniformly, do some invocations block progress by other invocations in the same subgroup?

D3D, Metal, and Vulkan are silent on both of these. This leads to non-portability.

This interacts very strongly with atomics (and loops).

Ambiguities about helper invocations

In a fragment shader, some invocations could be helper invocations, or could have been converted to one by a (D3D-style) discard. Do those helper invocations participate in subgroup operations?

I believe Vulkan/SPIR-V is silent on this, and there may be different behaviours.

MSL says that helper invocations are not "active".

I don't know enough about D3D to say.

Summary

Subgroups have many sharp corners for introducing ambiguity, non-portability, and undefined behaviour. The target APIs have sufficiently tight rules to allow good, portable, and reliable subgroup features.
I think subgroups are not a good candidate for inclusion in WebGPU MVP.

@litherum
Copy link
Contributor

litherum commented Aug 3, 2020

WebGPU telecon today approved the API-side of this PR.

The shading-language side of this PR still needs to be approved.

@mehmetoguzderin
Copy link
Member Author

mehmetoguzderin commented Sep 10, 2020

Based on the sample I built for the W3C Machine Learning Workshop, I have compared the execution speed with atomic (since atomics do not support floating-point operations, numerical loss happens with them) and shared alternatives. It turns out the SIMD version, which only utilizes the set provided in this PR, beats others by at least 2x on both Intel and Nvidia hardware. Such execution time difference can make a real impact on exploratory data analysis applications and potentially any application that hopes to run on GPU in a portable setting (to avoid battery drain and heat). And this is consistent with the findings of state-of-the-art particle methods, as mentioned in this slide deck, which gain ~10x speed increase with SIMD operations.

webgpu-20200910-simdgroup

@litherum
Copy link
Contributor

I transcribed @dneto0's above example to Metal: Convergence.zip

@litherum
Copy link
Contributor

And here it is transcribed to D3D12: Convergence.zip

@dj2
Copy link
Member

dj2 commented Sep 14, 2020

@litherum did you run them on Metal and D3D12? Did you get results similar to Vulkan where it diverges?

@litherum
Copy link
Contributor

litherum commented Sep 16, 2020

I did some data gathering:

OS Vendor GPU Behavior
Windows Intel Intel(R) HD Graphics 520 Hang
Windows Intel Intel(R) UHD Graphics 620
Windows AMD Radeon RX 560 Series
Windows Nvidia NVIDIA GeForce GTX 965M
Windows Nvidia NVIDIA GeForce GTX 1060
Windows Nvidia NVIDIA GeForce RTX 2080 Ti
Windows Microsoft Microsoft Basic Render Driver
Windows Qualcomm Qualcomm(R) Adreno(TM) 680 GPU Doesn't support wave ops
macOS AMD AMD Radeon RX 570
macOS AMD AMD Radeon RX 560
macOS AMD AMD Radeon Pro 560
macOS AMD AMD Radeon Pro 570
macOS AMD AMD Radeon Pro Vega 56
macOS AMD AMD Radeon HD - FirePro D500
macOS Intel Intel(R) HD Graphics 630
macOS Intel Intel(R) HD Graphics 5300
macOS Intel Intel(R) HD Graphics 515
macOS Apple Apple M1 255 255 255 255 rdar://problem/73006980

"✅" means the output was "255 240 240 240".
The Windows machines used D3D12, the macOS machines used Metal.

@qjia7
Copy link

qjia7 commented Sep 18, 2020

@litherum I didn’t meet the hang issue on Intel(R) HD Graphics 520 with driver 27.20.100.8587 on windows. The result is 255 240 240 240.

@litherum
Copy link
Contributor

litherum commented Sep 21, 2020

@qjia7

@litherum I didn’t meet the hang issue on Intel(R) HD Graphics 520 with driver 27.20.100.8587 on windows. The result is 255 240 240 240.

I'm on driver 24.20.100.6293. This is the one that Windows Update gave me. Is there a utility outside of Windows Update to update an Intel GPU driver?

Screenshot 2020-09-21 105816

@litherum
Copy link
Contributor

@dneto0

my desktop GPU, a mainstream NVIDIA workstation GPU card

Which card is it? We'd like to investigate further.

@kainino0x
Copy link
Contributor

@litherum you probably have the latest OEM Intel graphics drivers for your system. To get newer drivers you can try the Intel driver assistant, but if it says something like 'you have OEM drivers that can't be updated by the assistant', I think you can download a specific driver package, probably this one. It might disable some hardware integration (like certain display-panel-specific features).

@qjia7
Copy link

qjia7 commented Sep 22, 2020

Is there a utility outside of Windows Update to update an Intel GPU driver?

Like Kai said, you need to download a specific driver package from here. But it may reject you to upgrade the driver. The solution is to uninstall driver via device manager, then install the downloaded Intel driver package.

@gyagp
Copy link

gyagp commented Sep 22, 2020

You may find the detailed instructions to install Intel Graphics Driver on OEM devices at https://docs.google.com/document/d/1Fr5hi6BqlLVaJJoZEN7sGjukF4kM2qOAFb8mtbYx1Fo/edit#heading=h.4rbfm5zbtbyd

@grorg
Copy link
Contributor

grorg commented Sep 29, 2020

Discussed at the 2020-09-29 meeting.

@dneto0
Copy link
Contributor

dneto0 commented Oct 7, 2020

FYI. Nicolai Hähnle will be presenting at the LLVM Dev Meeting on Thursday (October 8). "Evolving “convergent”: Lessons from Control Flow in AMDGPU"
The abstract is:

GPUs execute many threads of a program in lock-step by mapping them to lanes of a SIMD vector that we call “wave”. Modern GPU programming languages have cross-lane operations such as shuffles, ballots, and barriers that exchange data between the lanes of a wave. When such operations execute in divergent control flow (lanes of a wave following different paths through the CFG), only a subset of lanes participate in this data exchange. A key part of defining the semantics of cross-lane operations is defining how this subset is determined.

In LLVM, the only tool available today to help in this definition is the convergent attribute. We argue that its definition is subtly broken and insufficient for expressing and preserving the desired behavior of cross-lane operations. We propose a new definition of convergent as well as the concept of “convergence tokens” and related intrinsics that allow frontends to describe the desired semantics of cross-lane operations in IR in a way that is easy to maintain by generic transforms. We also briefly touch on how these intrinsics are used by a new “wave transform” (whole program vectorization that lowers from thread-level CFG to wave-level CFG) in the AMDGPU backend.

https://llvm.org/devmtg/2020-09/schedule/

@litherum
Copy link
Contributor

litherum commented Jan 9, 2021

I've updated #954 (comment) to include the Apple M1 GPU.

@kdashg kdashg added this to the post-MVP milestone Jan 12, 2021
@dneto0
Copy link
Contributor

dneto0 commented Jan 12, 2021

My example was run on an NVIDIA Quadro P1000

@dj2 dj2 added the wgsl WebGPU Shading Language Issues label Feb 17, 2021
@mehmetoguzderin
Copy link
Member Author

Since this PR's branch became an orphan, moving the discussion to #1459 PR.

ben-clayton pushed a commit to ben-clayton/gpuweb that referenced this pull request Sep 6, 2022
* Plan api,operation,memory_sync,texture,*

* Address review feedback

* formatting
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wgsl WebGPU Shading Language Issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.