Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add subgroups, and make them portable if possible #4306

Open
dneto0 opened this issue Sep 26, 2023 · 14 comments
Open

add subgroups, and make them portable if possible #4306

dneto0 opened this issue Sep 26, 2023 · 14 comments
Labels
wgsl WebGPU Shading Language Issues
Milestone

Comments

@dneto0
Copy link
Contributor

dneto0 commented Sep 26, 2023

Add subgroup (a.k.a. simd_group, wave, wavefront) operations. Favour portability.

There have been a number of earlier issues and PRs. None seemed quite right to restart the conversation, so I'm opening this new one.

Previous work:

Implementations
gfx-rs/wgpu#4428 Naga. request.

Interactions with uniformity:


Benefits: Subgroup operations offer compelling performance benefits.

Drawbacks: ( There are theoretical reasons to doubt their portability. Earlier discussion included tiny demonstrations of nonportability.

Subgroups were postponed out of "v1" until we could devote more energy to investigating them in more detail. Now is the time.

@alan-baker has been leading an effort at Google to:

  • Implement an experimental subgroups extension supporting the "ballot" and "broadcast" operations only.
  • Write prototype conformance tests to check the portability behaviours we were concerned about. You only need "ballot" to do this because it tells you the effective active mask.

Let's use this issue to show the data, and then discuss how to shape the feature.

@dneto0 dneto0 added the wgsl WebGPU Shading Language Issues label Sep 26, 2023
@dneto0 dneto0 added this to the Milestone 2 milestone Sep 26, 2023
@magcius
Copy link

magcius commented Oct 11, 2023

While I don't want to push too much R&D work into WebGPU, I do think that a good number of people who would want subgroups might be happier with operations that work across the full workgroup.

Related: microsoft/hlsl-specs#105

@JMS55
Copy link

JMS55 commented Oct 11, 2023

I would personally love workgroup level reduction primitives, but yeah it would need more development than subgroups and I would be happy just to get subgroups.

Also I want to note, on some gpu architectures, subgroup size depends on the final compiled shader and is up to the driver. I'm unsure if polyfilling workgroup level operations on top of subgroup ops would be possible on these platforms without cooperation with the driver. I lack enough experience to say either way, I just want to bring it up as a point.

@alan-baker
Copy link
Contributor

Workgroup operations are out of scope for this issue, but feel free to file a new issue with that request so we can track it. I expect the answer will be not until it is exposed by underlying APIs.

@raphlinus
Copy link

Regarding the comment by @JMS55: yes, having unpredictable subgroup size is one of the major portability challenges, and that is the primary point of #3950. That issue contains a concrete proposal which I hope will be considered carefully. The core of it is that minimum and maximum subgroup size are constants that can be used at compile time, for example to size arrays, and the actual subgroup size is available at runtime. That may sound pretty basic, but underlying APIs (with the exception of Vulkan 1.3) are hostile to reliably providing that information.

I also recommend prioritizing subgroup operations in uniform control flow. All of the use cases I care deeply about are effectively in uniform control flow. My intuition is that specifying the semantics of those use cases might be easier than the fully general case, as well getting shader compilers to emit code consistent with that spec.

@Lichtso
Copy link

Lichtso commented Oct 20, 2023

gfx-rs/wgpu#4190 proposes a set of built-in functions and built-in values and implements them for DirectX, Metal, Vulkan and OpenGL. Feedback and ideas are welcome.

@alan-baker
Copy link
Contributor

alan-baker commented Nov 6, 2023

Chrome implemented minimal experimental extensions to experiment how much portability exists with subgroup operations.

Extension

WGSL

The following additions were made to WGSL:

  • subgroupBallot - an unpredicated version of ballot
  • subgroup_size - built-in value for subgroup size access
  • subgroup_invocation_id - built-in value for invocation id within a subgroup

API

The following additions were made to the API:

  • Two new features:
    • chromium-experimental-subgroups: core subgroup functionality
    • chromium-experimental-subgroup-uniform-control-flow: mirrors SPV_KHR_subgroup_uniform_control_flow, this is not implementable for most platforms
  • New limits: minSubgroupSize and maxSubgroupSize
  • A pipeline creation flag to require full subgroups

Requirements

  • Vulkan: key requirement for the experiments is subgroup size control
  • Metal: simd scoped permute operations supported
  • D3D: SM 6.0

Experiments

Divergence/Reconvergence

gpuweb/cts#2916 implements reconvergence tests for subgroup operations. The tests are based on Vulkan CTS's experimental reconvergence tests (available here).

There are 4 styles of reconvergence tested:

  • wgsl_v1 - this is intended to match the WGSL V1 spec
  • workgroup - a slight extension to wgsl_v1 that requires loop iterations to reconverge
  • subgroup - extends workgroup to subgroup scope (a la SPV_KHR_subgroup_uniform_control_flow)
  • maximal - set of rules that are expected to mostly match developer intuition for the HLL. No spec implements these rules

The tests are a combination of predefined and pseudo-randomly generated cases that are swept the various reconvergence styles. The program is simulated and those results are compared against the actual GPU results. Ideally, all implementations should pass at least wgsl_v1, but also hopefully workgroup and subgroup too. Maximal is more for investigation.

There is an additional set of tests (uniform_maximal) that check the behaviour when all branches are uniform (i.e. no divergence occurs in the workgroup). The expectation is that all implementations should pass these tests.

The tests can be run using dawn.node or chrome canary.

Results

We collected results from a variety of platforms and devices:

Predefined tests

GPU Driver Platform Impl Num Tests wgsl_v1 workgroup subgroup maximal
Apple M1 Pro 13.5.1 (22G90) Metal Chrome 15 12 12 12 4
Apple M1 Pro 14.1 (23B74) Metal Dawn 16 13 13 13 5
Intel HD 630 Metal Dawn 15 15 15 15 12
Intel HD TGL GT1 Mesa 22.3.6 Vulkan Dawn 16 16 16 16 16
Intel HD TGL GT2 Mesa 22.3.6 Vulkan Dawn 16 16 16 16 16
AMD Radeon Pro 560 Metal Dawn 15 14 14 14 12
AMD Radeon Pro WX 3200 Windows 10 (19045.3324) D3D12 15 14 14 14 14
Pixel 6 Pro (Mali G78) UPB5.230623.005 Vulkan Chrome 15 15 15 15 12
Pixel 3 (Adreno 630) SP1A.210812.016.C2 Vulkan Chrome 15 10 10 10 7
Nvidia Quadro P10001 Linux 525.125.6.384 Vulkan Dawn 15 15 9 12 7
  1. The Nvidia device seemed to give non-deterministic results across multiple runs.

Random tests

GPU Driver Platform Impl Num Tests wgsl_v1 workgroup subgroup maximal
Apple M1 Pro 13.5.1 (22G90) Metal Chrome 100 87 87 87 28
Apple M1 Pro 14.1 (23B74) Metal Dawn 100 87 87 87 28
Intel HD 630 Metal Dawn 100 100 100 100 79
Intel HD TGL GT11 Mesa 22.3.6 Vulkan Dawn 100 98 98 98 96
Intel HD TGL GT21 Mesa 22.3.6 Vulkan Dawn 100 100 99
AMD Radeon Pro 5602 Metal Dawn 100
AMD Radeon Pro WX 3200 Windows 10 (19045.3324) D3D12 100 36 36 34 62
Pixel 6 Pro (Mali G78) UPB5.230623.005 Vulkan Chrome 100 100 100 100 81
Pixel 3 (Adreno 630)3 SP1A.210812.016.C2 Vulkan Chrome 100
Nvidia Quadro P1000 Linux 525.125.6.384 Vulkan Dawn 100 100 100 100 86
  1. All failures were timeouts
  2. Compiler bug prevented testing. Many cases would memout around 80GB.
  3. Driver crashes prevented testing.

Uniform tests

The PR also contains a set of pseudo-randomly generated tests that always select uniform branches (uniform_maximal set). These were added later so we haven't collected as much information from them. All devices should pass these tests.

GPU Driver Platform Impl Num Tests maximal
Apple M1 Pro1 13.5.1 (22G90) Metal Chrome 500 498
Apple M1 Pro 14.1 (23B74) Metal Dawn 500 500
Intel HD 630 Metal Dawn 500 500
Intel HD TGL GT11 Mesa 22.3.6 Vulkan Dawn 500 500
Intel HD TGL GT21 Mesa 22.3.6 Vulkan Dawn 500 500
Pixel 6 Pro (Mali G78) UPB5.230623.005 Vulkan Chrome 500 500
Nvidia Quadro P1000 Linux 525.125.6.384 Vulkan Dawn 500 500
  1. Functional incorrectness in the two failures. Bugs were opened with Apple.

Subgroup Size

More testing is required to test that subgroup sizes are reliable, but early indications are that the requirements placed on Vulkan are sufficient. Metal also appears to be ok in this regard. D3D12 requires more testing.

The tests check that a ballot bit count matches the value from the subgroup size built-in value, but currently the PR does not check the newly added limits. I have an experimental patch (that requires IDL and Dawn changes) that verifies the Vulkan behaviour.

Further testing

I haven't been able to test the requires full subgroups pipeline flag yet. It has obvious implementations for Metal and Vulkan, but not D3D12. Most implementations seem to do the right thing here anyways though.

Discussion

Behaviour is not portable. The failures for even wgsl_v1 reconvergence are problematic. This means we cannot even produce portable behaviour by requiring the built-in functions only be used in uniform control flow. We could have explanations that behaviour is portable if you do not diverge the workgroup/subgroup. This is an important use case in terms of pure acceleration, but leaves large gaps in terms of overall portability of the feature.

@alan-baker
Copy link
Contributor

Made a proposal in #4368 based on the previous work.

@kdashg
Copy link
Contributor

kdashg commented Dec 6, 2023

WGSL 2023-12-05 Minutes
  • AB: The M1 part we want is setting a basic direction.
  • JB: Let’s talk about that int he second half of the meeting.

@munrocket
Copy link
Contributor

munrocket commented May 8, 2024

Not understand why we doesn't have invistigation with barriers

subgroupBarrier()
subgroupMemoryBarrier()
subgroupMemoryBarrierBuffer()
subgroupMemoryBarrierShared()
subgroupMemoryBarrierImage()

This topic rised 2 times in minutes.
https://github.com/gpuweb/gpuweb/wiki/WGSL-2020-04-21
https://github.com/gpuweb/gpuweb/wiki/WGSL-2022-06-21-Minutes

@dneto0
Copy link
Contributor Author

dneto0 commented Jun 5, 2024

FYI. @gfxstrand blogged about the what's needed to enforce "maximal reconvergence" on NV hardware. It's quite challenging, and it definitely doesn't happen by accident.

https://www.khronos.org/blog/khronos-releases-maximal-reconvergence-and-quad-control-extensions-for-vulkan-and-spir-v


edit: Whoops should have been https://www.collabora.com/news-and-blog/blog/2024/04/25/re-converging-control-flow-on-nvidia-gpus/

@raphlinus
Copy link

Was the link intended to be https://www.collabora.com/news-and-blog/blog/2024/04/25/re-converging-control-flow-on-nvidia-gpus/ ?

@dneto0
Copy link
Contributor Author

dneto0 commented Aug 8, 2024

The proposal talks a bit about non-full subgroups (when the subgroup size does not evenly divide the workgroup size):

TODO: Can we add a pipeline parameter to require full subgroups in compute shaders?

The proposal should describe what happens to subgroup_size in this non-full subgroup case:

  • Do all subgroups have the same subgroup_size, but the last one is non-full?
  • Or is subgroup_size adjusted down for the non-full one. In that case does the shader author have to compute the actual number of invocations via a ballot as here:
  // Record the actual subgroup size for this invocation.
  // Note: subgroup_size builtin is always a power-of-2 and might be larger
  // if the subgroup is not full.
  let ballot = subgroupBallot(true);
  var size = countOneBits(ballot.x);
  size  = countOneBits(ballot.y);
  size  = countOneBits(ballot.z);
  size  = countOneBits(ballot.w);

Also, is there at most one non-full subgroup? I could conceive of sizes 16 8 8 8 for example.

dneto0 added a commit to dneto0/gpuweb that referenced this issue Aug 19, 2024
This avoids undefined behaviour if the implementation doesn't know
to implement it as a shuffle.

Intentional uses of non-const IDs are either:
 - shuffle
 - broadcast-first

Issue: gpuweb#4306
Issue: crbug.com/360181411
@kenrussell
Copy link
Member

kenrussell commented Aug 19, 2024

While browsing a colleague's WebGPU implementation of prefix sums, I found a reference to subgroupInclusiveAdd which isn't in this proposal at present. Could it be added?

Edit: @raphlinus points out that while this can be polyfilled on top of subgroupExclusiveAdd, the hardware's primitives are faster on platforms where they're available. Thus subgroupInclusiveAdd can be universally supported, while being as fast as possible on supported hardware and compute APIs.

dneto0 added a commit to dneto0/gpuweb that referenced this issue Aug 21, 2024
This avoids undefined behaviour if the implementation doesn't know
to implement it as a shuffle.

Intentional uses of non-const IDs are either:
 - shuffle
 - broadcast-first

Issue: gpuweb#4306
Issue: crbug.com/360181411
@dneto0
Copy link
Contributor Author

dneto0 commented Aug 27, 2024

Should there be shader-creation-time (or pipeline-creation-time) checks on certain parameters when they are const (or override, respectively):

  • subgroupShuffleXor mask operand: if it's const it must be < 128 (which is the max subgroup size). Or worse IMHO, bounded above by the specific device's max subgroup size.
  • subgroup invocation id, and ID deltas (for shuffle up, shuffle down), are bounded above by spec-supplied max subgroup size 128. (Or worse IMHO max subgroup size on the target device)

dneto0 added a commit to dneto0/gpuweb that referenced this issue Aug 28, 2024
This avoids undefined behaviour if the implementation doesn't know
to implement it as a shuffle.

Intentional uses of non-const IDs are either:
 - shuffle
 - broadcast-first

Issue: gpuweb#4306
Issue: crbug.com/360181411
dneto0 added a commit that referenced this issue Aug 28, 2024
* subgroups: subgroupBroadcast 'id' parameter is const-expression

This avoids undefined behaviour if the implementation doesn't know
to implement it as a shuffle.

Intentional uses of non-const IDs are either:
 - shuffle
 - broadcast-first

Issue: #4306
Issue: crbug.com/360181411

* Update proposals/subgroups.md

fix footnotes

Co-authored-by: alan-baker <[email protected]>

* Update proposals/subgroups.md

fix footnotes

Co-authored-by: alan-baker <[email protected]>

---------

Co-authored-by: alan-baker <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wgsl WebGPU Shading Language Issues
Projects
None yet
Development

No branches or pull requests

9 participants