-
-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Parallelize batching #12489
base: main
Are you sure you want to change the base?
Parallelize batching #12489
Conversation
crates/bevy_pbr/src/prepass/mod.rs
Outdated
batch_and_prepare_render_phase::<Opaque3dPrepass, MeshPipeline>, | ||
batch_and_prepare_render_phase::<AlphaMask3dPrepass, MeshPipeline>, | ||
) | ||
.after(allocate_batch_buffer::<MeshPipeline>), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This starts to look like we maybe want a batching plugin that is generic over render phase item and the base specialisation pipeline or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed there. This is getting unwieldy to manage.
Today, when we upload data to a `StorageBuffer`, it must be copied twice: once from `StorageBuffer::value` to `StorageBuffer::scratch`, and once from `StorageBuffer::scratch` to the GPU. This patch eliminates the former copy in favor of storing the single canonical copy of the data directly inside the `scratch` field. This patch complements bevyengine#12489, which eliminates both copies for mesh data at the cost of requiring more overhead for GPU I/O, especially for single-threaded situations. Simply eliminating one of the copies seems an unambiguous win (and is lower-hanging fruit), so both this patch and PR bevyengine#12489 are valuable.
Co-Authored By: robert.swain@gmail.com
Objective
Improve CPU-side rendering performance.
Solution
Create
BufferPool
variants ofDynamicBuffer
,StorageBuffer
,BatchedUniformBuffer
, andGpuArrayBuffer
. They do not have a system RAM side buffer of any kind, but rely onQueue::write_buffer_with
to directly write values into a staging buffer. AsQueue::write_buffer
with operates with a&Queue
, it's possible to parallelize down to a view level when batching.The downside is that these buffers are not resizable after being mapped, so these types must reserve fixed sized slices from the buffer ahead of time. The data flow runs as follows:
clear_batch_buffer
clears the buffer pool.reserve_batch_buffer
mutably grabs the buffer pool and reserves a range for everyRenderPhase<T>
and saves the reserved in theRenderPhase
. This is a very fast O(1) operation that does not require allocation or IO of any kind. These must run sequentially due to needing to grab the pool and render phases in parallel.allocate_batch_buffer
allocates the actual GPU-side buffer.batch_and_prepare_render_phase
then runs on each of them in parallel and parallelizes individual views withQuery::par_iter_mut
.NOTE: I'm likely getting something wrong with the indices, which is causing mesh draw calls to be mismatched, causing them to be rendered randomly, which can be potentially seizure inducing depending on the scene. Please be aware of this while testing this change.
Performance
Tested against
many_foxes
, this nets a rough 4% improvement in render schedule timings.TODO: Test this against heavier scenes.
Changelog
TODO
Migration Guide
TODO