-
Notifications
You must be signed in to change notification settings - Fork 319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarify in spec that getMappedRange() ranges should be used to optimize map invalidations/copies #4805
Comments
I"ve created a live sample to illustrate the issue with mapping large buffers. The code is public and works on Chrome or Firefox Nightly. You can adjust the amount of data uploaded per frame, the buffer size, and the update mode:
While the third approach gives the best performance, it"s really just a brittle hack forced by current API limitations. Here are frame times on Chrome when uploading 20MB of data:
As you can see, mapping is faster, so long as we keep the GPU process aware of the written range. Unfortunately the range is defined by |
Thank you for the detailed investigation, it is extremely clear!
That"s definitely the intent and Chromium should implement that optimization at some point, we just didn"t get around to it yet. Could you detail why it would be hard to use |
That makes the problem more tractable. Thanks for opening the issue!
Because the caller of Let"s imagine a Alternatively, a developer may be tempted to simply call |
As a tangent, why does |
getMappedRange() is synchronous, allowing you to determine which ranges need to actually be mapped at the last moment. If you have some kind of streaming write, and so you know the offset but not the size, you can getMappedRange() in blocks until you reach the end of the stream. (getMappedRange() can be called multiple times for ranges within a single mapAsync(), as long as they don"t overlap.) |
I don"t think this is possible because it would require the browser to trust the webpage about which ranges it wrote. If the webpage writes a range and doesn"t dirty it, it would result in undefined behavior for whether the data got written, or not written, or partially written, etc. |
Thanks for the feedback.
I mentioned this earlier but didn"t go into detail. Having to call
That"s a fair point. If the goal of WebGPU is to have well defined behavior, even though it"s just stale data in a buffer after misusing the API, then |
Given the mentioned constraints, and the expected browser optimizations, I think it"s fine to close this issue.
Could we add an implementation note in the spec proper? I think it"s of interest to both API users and implementers. |
Yes, let"s do that. Thanks Corentin for marking this |
subject was: "Allow specifying the written range when unmapping a buffer"
I noticed an issue while benchmarking CPU-to-GPU data streaming that can degrade performance of memory-mapped transfers, making them worse than
queue.writeBuffer()
. While there are partial workarounds, a minor API change seems necessary.Problem
WebGPU supports buffer mapping to transfer data directly to / from the GPU. Applications can use a ring of staging buffers to continuously feed the GPU with new data each frame, as described in this article:
usage = MAP_WRITE | COPY_SRC
andmappedAtCreation = true
.mapAsync()
with a callback to reclaim the buffer.In this setup the buffer size is fixed, let"s say 128MB. Assuming the app stays within that budget, the staging ring stabilizes at size 3. So the memory overhead is tolerable and this approach usually results in faster transfers. This is indeed the case when enough data is streamed to fill the staging buffer. I"ve measured gains of 10-40% vs.
writeBuffer()
in heavy workloads.The problem arises when the amount of data streamed varies per frame. A stationary scene in a game may have little data to transfer, in which case
writeBuffer()
becomes faster, while the mapping version still incurs the cost of copying the entire staging buffer. Why?mapAsync()
, the app doesn"t know how much data it will put in the staging buffer. So it requests the entire range.unmap()
, the GPU process has to copy the data to its destination and it uses the range frommapAsync()
.Proposal
Could we trim this final copy to the actually written range? Since the amount of data isn"t known at the time of
mapAsync()
or evengetMappedRange()
, I propose an API tweak likeunmap(optional dirtyOffset, optional dirtySize)
. Alternatively, a method likeaddDirtyRange(offset, size)
could support multiple ranges.Alternatives considered
getMappedRange(offset, size)
as mentioned in the WebGPU Explainer. However, Chrome doesn"t seem to perform this optimization. More important, it"s inconvenient for the app to precalculate the minimal range.writeBuffer()
for small transfers or use smaller staging buffers.The text was updated successfully, but these errors were encountered: