-
Notifications
You must be signed in to change notification settings - Fork 319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add subgroups, and make them portable if possible #4306
Comments
While I don't want to push too much R&D work into WebGPU, I do think that a good number of people who would want subgroups might be happier with operations that work across the full workgroup. Related: microsoft/hlsl-specs#105 |
I would personally love workgroup level reduction primitives, but yeah it would need more development than subgroups and I would be happy just to get subgroups. Also I want to note, on some gpu architectures, subgroup size depends on the final compiled shader and is up to the driver. I'm unsure if polyfilling workgroup level operations on top of subgroup ops would be possible on these platforms without cooperation with the driver. I lack enough experience to say either way, I just want to bring it up as a point. |
Workgroup operations are out of scope for this issue, but feel free to file a new issue with that request so we can track it. I expect the answer will be not until it is exposed by underlying APIs. |
Regarding the comment by @JMS55: yes, having unpredictable subgroup size is one of the major portability challenges, and that is the primary point of #3950. That issue contains a concrete proposal which I hope will be considered carefully. The core of it is that minimum and maximum subgroup size are constants that can be used at compile time, for example to size arrays, and the actual subgroup size is available at runtime. That may sound pretty basic, but underlying APIs (with the exception of Vulkan 1.3) are hostile to reliably providing that information. I also recommend prioritizing subgroup operations in uniform control flow. All of the use cases I care deeply about are effectively in uniform control flow. My intuition is that specifying the semantics of those use cases might be easier than the fully general case, as well getting shader compilers to emit code consistent with that spec. |
gfx-rs/wgpu#4190 proposes a set of built-in functions and built-in values and implements them for DirectX, Metal, Vulkan and OpenGL. Feedback and ideas are welcome. |
Chrome implemented minimal experimental extensions to experiment how much portability exists with subgroup operations. ExtensionWGSLThe following additions were made to WGSL:
APIThe following additions were made to the API:
Requirements
ExperimentsDivergence/Reconvergencegpuweb/cts#2916 implements reconvergence tests for subgroup operations. The tests are based on Vulkan CTS's experimental reconvergence tests (available here). There are 4 styles of reconvergence tested:
The tests are a combination of predefined and pseudo-randomly generated cases that are swept the various reconvergence styles. The program is simulated and those results are compared against the actual GPU results. Ideally, all implementations should pass at least wgsl_v1, but also hopefully workgroup and subgroup too. Maximal is more for investigation. There is an additional set of tests (uniform_maximal) that check the behaviour when all branches are uniform (i.e. no divergence occurs in the workgroup). The expectation is that all implementations should pass these tests. The tests can be run using dawn.node or chrome canary. ResultsWe collected results from a variety of platforms and devices: Predefined tests
Random tests
Uniform testsThe PR also contains a set of pseudo-randomly generated tests that always select uniform branches (uniform_maximal set). These were added later so we haven't collected as much information from them. All devices should pass these tests.
Subgroup SizeMore testing is required to test that subgroup sizes are reliable, but early indications are that the requirements placed on Vulkan are sufficient. Metal also appears to be ok in this regard. D3D12 requires more testing. The tests check that a ballot bit count matches the value from the subgroup size built-in value, but currently the PR does not check the newly added limits. I have an experimental patch (that requires IDL and Dawn changes) that verifies the Vulkan behaviour. Further testingI haven't been able to test the requires full subgroups pipeline flag yet. It has obvious implementations for Metal and Vulkan, but not D3D12. Most implementations seem to do the right thing here anyways though. DiscussionBehaviour is not portable. The failures for even wgsl_v1 reconvergence are problematic. This means we cannot even produce portable behaviour by requiring the built-in functions only be used in uniform control flow. We could have explanations that behaviour is portable if you do not diverge the workgroup/subgroup. This is an important use case in terms of pure acceleration, but leaves large gaps in terms of overall portability of the feature. |
Made a proposal in #4368 based on the previous work. |
WGSL 2023-12-05 Minutes
|
Not understand why we doesn't have invistigation with barriers subgroupBarrier()
subgroupMemoryBarrier()
subgroupMemoryBarrierBuffer()
subgroupMemoryBarrierShared()
subgroupMemoryBarrierImage() This topic rised 2 times in minutes. |
FYI. @gfxstrand blogged about the what's needed to enforce "maximal reconvergence" on NV hardware. It's quite challenging, and it definitely doesn't happen by accident. edit: Whoops should have been https://www.collabora.com/news-and-blog/blog/2024/04/25/re-converging-control-flow-on-nvidia-gpus/ |
Was the link intended to be https://www.collabora.com/news-and-blog/blog/2024/04/25/re-converging-control-flow-on-nvidia-gpus/ ? |
The proposal talks a bit about non-full subgroups (when the subgroup size does not evenly divide the workgroup size):
The proposal should describe what happens to
Also, is there at most one non-full subgroup? I could conceive of sizes 16 8 8 8 for example. |
This avoids undefined behaviour if the implementation doesn't know to implement it as a shuffle. Intentional uses of non-const IDs are either: - shuffle - broadcast-first Issue: gpuweb#4306 Issue: crbug.com/360181411
While browsing a colleague's WebGPU implementation of prefix sums, I found a reference to Edit: @raphlinus points out that while this can be polyfilled on top of |
This avoids undefined behaviour if the implementation doesn't know to implement it as a shuffle. Intentional uses of non-const IDs are either: - shuffle - broadcast-first Issue: gpuweb#4306 Issue: crbug.com/360181411
Should there be shader-creation-time (or pipeline-creation-time) checks on certain parameters when they are const (or override, respectively):
|
This avoids undefined behaviour if the implementation doesn't know to implement it as a shuffle. Intentional uses of non-const IDs are either: - shuffle - broadcast-first Issue: gpuweb#4306 Issue: crbug.com/360181411
* subgroups: subgroupBroadcast 'id' parameter is const-expression This avoids undefined behaviour if the implementation doesn't know to implement it as a shuffle. Intentional uses of non-const IDs are either: - shuffle - broadcast-first Issue: #4306 Issue: crbug.com/360181411 * Update proposals/subgroups.md fix footnotes Co-authored-by: alan-baker <[email protected]> * Update proposals/subgroups.md fix footnotes Co-authored-by: alan-baker <[email protected]> --------- Co-authored-by: alan-baker <[email protected]>
Add subgroup (a.k.a. simd_group, wave, wavefront) operations. Favour portability.
There have been a number of earlier issues and PRs. None seemed quite right to restart the conversation, so I'm opening this new one.
Previous work:
Implementations
gfx-rs/wgpu#4428 Naga. request.
Interactions with uniformity:
Benefits: Subgroup operations offer compelling performance benefits.
Drawbacks: ( There are theoretical reasons to doubt their portability. Earlier discussion included tiny demonstrations of nonportability.
Subgroups were postponed out of "v1" until we could devote more energy to investigating them in more detail. Now is the time.
@alan-baker has been leading an effort at Google to:
Let's use this issue to show the data, and then discuss how to shape the feature.
The text was updated successfully, but these errors were encountered: