add subgroups, and make them portable if possible #4306

dneto0 · 2023-09-26T21:28:55Z

Add subgroup (a.k.a. simd_group, wave, wavefront) operations. Favour portability.

There have been a number of earlier issues and PRs. None seemed quite right to restart the conversation, so I'm opening this new one.

Previous work:

Introduce Subgroup Operations Extension #954 Closed. Oguz’s original PR.
- Includes discussion of nonportability: theoretical and theoretical and demo and Myles’s broader replication of the demo (including an M1 bug, and an Intel hang)
Introduce Subgroup Operations Extension #1459 Closed. Oguz’s second PR.
Add capability to query gpu "warp" size #4290 Request query for subgroup size (“warp”)
Investigation: Querying Subgroup Support #78 Investigation: Querying subgroup support
Considerations for subgroups #3950 Raph's “Considerations for subgroups”
Request for compute: anyInvocation() and allInvocation() #2137 Request for compute: anyInvocation(), allInvocation()

Implementations
gfx-rs/wgpu#4428 Naga. request.

Interactions with uniformity:

Annotation for the uniformity analysis #2323 annotation for uniformity analysis
Uniformity annotations for global variables #1791 Uniformity annotations for global variables

Benefits: Subgroup operations offer compelling performance benefits.

Drawbacks: ( There are theoretical reasons to doubt their portability. Earlier discussion included tiny demonstrations of nonportability.

Subgroups were postponed out of "v1" until we could devote more energy to investigating them in more detail. Now is the time.

@alan-baker has been leading an effort at Google to:

Implement an experimental subgroups extension supporting the "ballot" and "broadcast" operations only.
Write prototype conformance tests to check the portability behaviours we were concerned about. You only need "ballot" to do this because it tells you the effective active mask.
- This work is in this draft PR: WIP: Experimental reconvergence tests cts#2916

Let's use this issue to show the data, and then discuss how to shape the feature.

magcius · 2023-10-11T02:55:57Z

While I don't want to push too much R&D work into WebGPU, I do think that a good number of people who would want subgroups might be happier with operations that work across the full workgroup.

Related: microsoft/hlsl-specs#105

JMS55 · 2023-10-11T05:19:53Z

I would personally love workgroup level reduction primitives, but yeah it would need more development than subgroups and I would be happy just to get subgroups.

Also I want to note, on some gpu architectures, subgroup size depends on the final compiled shader and is up to the driver. I'm unsure if polyfilling workgroup level operations on top of subgroup ops would be possible on these platforms without cooperation with the driver. I lack enough experience to say either way, I just want to bring it up as a point.

alan-baker · 2023-10-11T14:42:21Z

Workgroup operations are out of scope for this issue, but feel free to file a new issue with that request so we can track it. I expect the answer will be not until it is exposed by underlying APIs.

raphlinus · 2023-10-11T19:09:41Z

Regarding the comment by @JMS55: yes, having unpredictable subgroup size is one of the major portability challenges, and that is the primary point of #3950. That issue contains a concrete proposal which I hope will be considered carefully. The core of it is that minimum and maximum subgroup size are constants that can be used at compile time, for example to size arrays, and the actual subgroup size is available at runtime. That may sound pretty basic, but underlying APIs (with the exception of Vulkan 1.3) are hostile to reliably providing that information.

I also recommend prioritizing subgroup operations in uniform control flow. All of the use cases I care deeply about are effectively in uniform control flow. My intuition is that specifying the semantics of those use cases might be easier than the fully general case, as well getting shader compilers to emit code consistent with that spec.

Lichtso · 2023-10-20T22:06:36Z

gfx-rs/wgpu#4190 proposes a set of built-in functions and built-in values and implements them for DirectX, Metal, Vulkan and OpenGL. Feedback and ideas are welcome.

alan-baker · 2023-11-06T17:00:52Z

Chrome implemented minimal experimental extensions to experiment how much portability exists with subgroup operations.

Extension

WGSL

The following additions were made to WGSL:

subgroupBallot - an unpredicated version of ballot
subgroup_size - built-in value for subgroup size access
subgroup_invocation_id - built-in value for invocation id within a subgroup

API

The following additions were made to the API:

Two new features:
- chromium-experimental-subgroups: core subgroup functionality
- chromium-experimental-subgroup-uniform-control-flow: mirrors SPV_KHR_subgroup_uniform_control_flow, this is not implementable for most platforms
New limits: minSubgroupSize and maxSubgroupSize
A pipeline creation flag to require full subgroups

Requirements

Vulkan: key requirement for the experiments is subgroup size control
Metal: simd scoped permute operations supported
D3D: SM 6.0

Experiments

Divergence/Reconvergence

gpuweb/cts#2916 implements reconvergence tests for subgroup operations. The tests are based on Vulkan CTS's experimental reconvergence tests (available here).

There are 4 styles of reconvergence tested:

wgsl_v1 - this is intended to match the WGSL V1 spec
workgroup - a slight extension to wgsl_v1 that requires loop iterations to reconverge
subgroup - extends workgroup to subgroup scope (a la SPV_KHR_subgroup_uniform_control_flow)
maximal - set of rules that are expected to mostly match developer intuition for the HLL. No spec implements these rules

The tests are a combination of predefined and pseudo-randomly generated cases that are swept the various reconvergence styles. The program is simulated and those results are compared against the actual GPU results. Ideally, all implementations should pass at least wgsl_v1, but also hopefully workgroup and subgroup too. Maximal is more for investigation.

There is an additional set of tests (uniform_maximal) that check the behaviour when all branches are uniform (i.e. no divergence occurs in the workgroup). The expectation is that all implementations should pass these tests.

The tests can be run using dawn.node or chrome canary.

Results

We collected results from a variety of platforms and devices:

Predefined tests

GPU	Driver	Platform	Impl	Num Tests	wgsl_v1	workgroup	subgroup	maximal
Apple M1 Pro	13.5.1 (22G90)	Metal	Chrome	15	12	12	12	4
Apple M1 Pro	14.1 (23B74)	Metal	Dawn	16	13	13	13	5
Intel HD 630		Metal	Dawn	15	15	15	15	12
Intel HD TGL GT1	Mesa 22.3.6	Vulkan	Dawn	16	16	16	16	16
Intel HD TGL GT2	Mesa 22.3.6	Vulkan	Dawn	16	16	16	16	16
AMD Radeon Pro 560		Metal	Dawn	15	14	14	14	12
AMD Radeon Pro WX 3200	Windows 10 (19045.3324)	D3D12		15	14	14	14	14
Pixel 6 Pro (Mali G78)	UPB5.230623.005	Vulkan	Chrome	15	15	15	15	12
Pixel 3 (Adreno 630)	SP1A.210812.016.C2	Vulkan	Chrome	15	10	10	10	7
Nvidia Quadro P1000¹	Linux 525.125.6.384	Vulkan	Dawn	15	15	9	12	7

The Nvidia device seemed to give non-deterministic results across multiple runs.

Random tests

GPU	Driver	Platform	Impl	Num Tests	wgsl_v1	workgroup	subgroup	maximal
Apple M1 Pro	13.5.1 (22G90)	Metal	Chrome	100	87	87	87	28
Apple M1 Pro	14.1 (23B74)	Metal	Dawn	100	87	87	87	28
Intel HD 630		Metal	Dawn	100	100	100	100	79
Intel HD TGL GT1¹	Mesa 22.3.6	Vulkan	Dawn	100	98	98	98	96
Intel HD TGL GT2¹	Mesa 22.3.6	Vulkan	Dawn	100	100			99
AMD Radeon Pro 560²		Metal	Dawn	100
AMD Radeon Pro WX 3200	Windows 10 (19045.3324)	D3D12		100	36	36	34	62
Pixel 6 Pro (Mali G78)	UPB5.230623.005	Vulkan	Chrome	100	100	100	100	81
Pixel 3 (Adreno 630)³	SP1A.210812.016.C2	Vulkan	Chrome	100
Nvidia Quadro P1000	Linux 525.125.6.384	Vulkan	Dawn	100	100	100	100	86

All failures were timeouts
Compiler bug prevented testing. Many cases would memout around 80GB.
Driver crashes prevented testing.

Uniform tests

The PR also contains a set of pseudo-randomly generated tests that always select uniform branches (uniform_maximal set). These were added later so we haven't collected as much information from them. All devices should pass these tests.

GPU	Driver	Platform	Impl	Num Tests	maximal
Apple M1 Pro¹	13.5.1 (22G90)	Metal	Chrome	500	498
Apple M1 Pro	14.1 (23B74)	Metal	Dawn	500	500
Intel HD 630		Metal	Dawn	500	500
Intel HD TGL GT1¹	Mesa 22.3.6	Vulkan	Dawn	500	500
Intel HD TGL GT2¹	Mesa 22.3.6	Vulkan	Dawn	500	500
Pixel 6 Pro (Mali G78)	UPB5.230623.005	Vulkan	Chrome	500	500
Nvidia Quadro P1000	Linux 525.125.6.384	Vulkan	Dawn	500	500

Functional incorrectness in the two failures. Bugs were opened with Apple.

Subgroup Size

More testing is required to test that subgroup sizes are reliable, but early indications are that the requirements placed on Vulkan are sufficient. Metal also appears to be ok in this regard. D3D12 requires more testing.

The tests check that a ballot bit count matches the value from the subgroup size built-in value, but currently the PR does not check the newly added limits. I have an experimental patch (that requires IDL and Dawn changes) that verifies the Vulkan behaviour.

Further testing

I haven't been able to test the requires full subgroups pipeline flag yet. It has obvious implementations for Metal and Vulkan, but not D3D12. Most implementations seem to do the right thing here anyways though.

Discussion

Behaviour is not portable. The failures for even wgsl_v1 reconvergence are problematic. This means we cannot even produce portable behaviour by requiring the built-in functions only be used in uniform control flow. We could have explanations that behaviour is portable if you do not diverge the workgroup/subgroup. This is an important use case in terms of pure acceleration, but leaves large gaps in terms of overall portability of the feature.

alan-baker · 2023-11-07T21:58:37Z

Made a proposal in #4368 based on the previous work.

kdashg · 2023-12-06T22:39:50Z

WGSL 2023-12-05 Minutes

AB: The M1 part we want is setting a basic direction.
JB: Let’s talk about that int he second half of the meeting.

munrocket · 2024-05-08T21:15:56Z

Not understand why we doesn't have invistigation with barriers

subgroupBarrier()
subgroupMemoryBarrier()
subgroupMemoryBarrierBuffer()
subgroupMemoryBarrierShared()
subgroupMemoryBarrierImage()

This topic rised 2 times in minutes.
https://github.com/gpuweb/gpuweb/wiki/WGSL-2020-04-21
https://github.com/gpuweb/gpuweb/wiki/WGSL-2022-06-21-Minutes

dneto0 · 2024-06-05T17:03:16Z

FYI. @gfxstrand blogged about the what's needed to enforce "maximal reconvergence" on NV hardware. It's quite challenging, and it definitely doesn't happen by accident.

~~https://www.khronos.org/blog/khronos-releases-maximal-reconvergence-and-quad-control-extensions-for-vulkan-and-spir-v~~

edit: Whoops should have been https://www.collabora.com/news-and-blog/blog/2024/04/25/re-converging-control-flow-on-nvidia-gpus/

raphlinus · 2024-06-05T17:16:43Z

Was the link intended to be https://www.collabora.com/news-and-blog/blog/2024/04/25/re-converging-control-flow-on-nvidia-gpus/ ?

dneto0 · 2024-08-08T20:26:40Z

The proposal talks a bit about non-full subgroups (when the subgroup size does not evenly divide the workgroup size):

TODO: Can we add a pipeline parameter to require full subgroups in compute shaders?

The proposal should describe what happens to subgroup_size in this non-full subgroup case:

Do all subgroups have the same subgroup_size, but the last one is non-full?
Or is subgroup_size adjusted down for the non-full one. In that case does the shader author have to compute the actual number of invocations via a ballot as here:

  // Record the actual subgroup size for this invocation.
  // Note: subgroup_size builtin is always a power-of-2 and might be larger
  // if the subgroup is not full.
  let ballot = subgroupBallot(true);
  var size = countOneBits(ballot.x);
  size  = countOneBits(ballot.y);
  size  = countOneBits(ballot.z);
  size  = countOneBits(ballot.w);

Also, is there at most one non-full subgroup? I could conceive of sizes 16 8 8 8 for example.

This avoids undefined behaviour if the implementation doesn't know to implement it as a shuffle. Intentional uses of non-const IDs are either: - shuffle - broadcast-first Issue: gpuweb#4306 Issue: crbug.com/360181411

kenrussell · 2024-08-19T21:38:53Z

While browsing a colleague's WebGPU implementation of prefix sums, I found a reference to subgroupInclusiveAdd which isn't in this proposal at present. Could it be added?

Edit: @raphlinus points out that while this can be polyfilled on top of subgroupExclusiveAdd, the hardware's primitives are faster on platforms where they're available. Thus subgroupInclusiveAdd can be universally supported, while being as fast as possible on supported hardware and compute APIs.

This avoids undefined behaviour if the implementation doesn't know to implement it as a shuffle. Intentional uses of non-const IDs are either: - shuffle - broadcast-first Issue: gpuweb#4306 Issue: crbug.com/360181411

dneto0 · 2024-08-27T20:07:26Z

Should there be shader-creation-time (or pipeline-creation-time) checks on certain parameters when they are const (or override, respectively):

subgroupShuffleXor mask operand: if it's const it must be < 128 (which is the max subgroup size). Or worse IMHO, bounded above by the specific device's max subgroup size.
subgroup invocation id, and ID deltas (for shuffle up, shuffle down), are bounded above by spec-supplied max subgroup size 128. (Or worse IMHO max subgroup size on the target device)

This avoids undefined behaviour if the implementation doesn't know to implement it as a shuffle. Intentional uses of non-const IDs are either: - shuffle - broadcast-first Issue: gpuweb#4306 Issue: crbug.com/360181411

* subgroups: subgroupBroadcast 'id' parameter is const-expression This avoids undefined behaviour if the implementation doesn't know to implement it as a shuffle. Intentional uses of non-const IDs are either: - shuffle - broadcast-first Issue: #4306 Issue: crbug.com/360181411 * Update proposals/subgroups.md fix footnotes Co-authored-by: alan-baker <[email protected]> * Update proposals/subgroups.md fix footnotes Co-authored-by: alan-baker <[email protected]> --------- Co-authored-by: alan-baker <[email protected]>

dneto0 added the wgsl WebGPU Shading Language Issues label Sep 26, 2023

dneto0 added this to the Milestone 2 milestone Sep 26, 2023

dneto0 mentioned this issue Sep 27, 2023

Support for explicit matmul instructions / tensor core instructions #4137

Open

This was referenced Sep 30, 2023

Subgroup Operations gfx-rs/naga#2523

Closed

Subgroup Operations gfx-rs/wgpu#4190

Closed

alan-baker mentioned this issue Jan 25, 2024

[SPIR-V] Expose maximal reconvergence microsoft/hlsl-specs#164

Closed

dneto0 mentioned this issue Aug 8, 2024

CTS for subgroupAdd and subgroupMul gpuweb/cts#3897

Merged

8 tasks

dneto0 mentioned this issue Aug 19, 2024

subgroups: subgroupBroadcast 'id' parameter is const-expression #4820

Merged

alan-baker mentioned this issue Aug 20, 2024

Add inclusive scan add and mul operations #4822

Merged

This was referenced Aug 27, 2024

Remaining subgroup validation tests gpuweb/cts#3920

Merged

Subgroup limits aren't really limits #4834

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add subgroups, and make them portable if possible #4306

add subgroups, and make them portable if possible #4306

dneto0 commented Sep 26, 2023 •

edited

Loading

magcius commented Oct 11, 2023 •

edited

Loading

JMS55 commented Oct 11, 2023

alan-baker commented Oct 11, 2023

raphlinus commented Oct 11, 2023

Lichtso commented Oct 20, 2023 •

edited

Loading

alan-baker commented Nov 6, 2023 •

edited by Kangz

Loading

alan-baker commented Nov 7, 2023

kdashg commented Dec 6, 2023

munrocket commented May 8, 2024 •

edited

Loading

dneto0 commented Jun 5, 2024 •

edited

Loading

raphlinus commented Jun 5, 2024

dneto0 commented Aug 8, 2024

kenrussell commented Aug 19, 2024 •

edited

Loading

dneto0 commented Aug 27, 2024

add subgroups, and make them portable if possible #4306

add subgroups, and make them portable if possible #4306

Comments

dneto0 commented Sep 26, 2023 • edited Loading

magcius commented Oct 11, 2023 • edited Loading

JMS55 commented Oct 11, 2023

alan-baker commented Oct 11, 2023

raphlinus commented Oct 11, 2023

Lichtso commented Oct 20, 2023 • edited Loading

alan-baker commented Nov 6, 2023 • edited by Kangz Loading

Extension

WGSL

API

Requirements

Experiments

Divergence/Reconvergence

Results

Predefined tests

Random tests

Uniform tests

Subgroup Size

Further testing

Discussion

alan-baker commented Nov 7, 2023

kdashg commented Dec 6, 2023

munrocket commented May 8, 2024 • edited Loading

dneto0 commented Jun 5, 2024 • edited Loading

raphlinus commented Jun 5, 2024

dneto0 commented Aug 8, 2024

kenrussell commented Aug 19, 2024 • edited Loading

dneto0 commented Aug 27, 2024

dneto0 commented Sep 26, 2023 •

edited

Loading

magcius commented Oct 11, 2023 •

edited

Loading

Lichtso commented Oct 20, 2023 •

edited

Loading

alan-baker commented Nov 6, 2023 •

edited by Kangz

Loading

munrocket commented May 8, 2024 •

edited

Loading

dneto0 commented Jun 5, 2024 •

edited

Loading

kenrussell commented Aug 19, 2024 •

edited

Loading