Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

GPU Instancing #89

Closed
cart opened this issue Aug 5, 2020 · 56 comments
Closed

GPU Instancing #89

cart opened this issue Aug 5, 2020 · 56 comments
Labels
A-Rendering Drawing game state to the screen C-Feature A new feature, making something new possible

Comments

@cart
Copy link
Member

cart commented Aug 5, 2020

The Bevy renderer should be able to instance entities that share a subset of the same properties.

@karroffel karroffel added C-Feature A new feature, making something new possible A-Rendering Drawing game state to the screen labels Aug 12, 2020
@chrisburnor
Copy link

What is the dependency relationship between this issue and #179 ? It seems like instancing would be a significant part of refactoring the rendering pipeline and ideally we'd want to do the work for that issue in a way that doesn't require a significant amount of refactoring.

Perhaps we could simply merge the two and say that instancing is a requirement of that rendering pipeline.

@cart
Copy link
Member Author

cart commented Dec 30, 2020

I don't see much of a need to couple them together. They will touch some of the same code, but in general I think they are separate features. I'm not convinced we need to block one on the other.

GPU instancing is a matter of grouping instanced entities together, writing values that are different to vertex attributes, and making a single "instanced" draw call.

Most of PBR is shader work, and anything that's more complicated than that (ex: shadow maps) won't have too much bearing on the instancing work.

@MDeiml
Copy link
Contributor

MDeiml commented Feb 5, 2021

I've been thinking about this a bit. Sorry in advance for the content drop.

Entities that could be instanced together would have to have some things in common:

  1. Same mesh
  2. Same Render Pipeline
  3. Same bind groups for slots which are not instanced

For the interface I would propose to add a InstancedRenderResourcesNode<T, Q> similar to RenderResourcesNode. Like RenderResourceNode it has a type parameter T: RenderResource, but it has another type parameter Q: WorldQuery. Only entities for which fetching Q returns the same value get instanced together. Q could for example be (Handle<Mesh>, Handle<Material> in a normal use case. For each value of Q there must be exactly one entity with a RenderPipelines component, into which the generated bindings can be inserted.

Introducing this query parameter is a bit ugly, but I don't think there is a way around it. Which entities can be batched together depends on the render graph / pipeline and could theoretically be detected automatically, but only during the DRAW stage, which is of course to late. The advantage is that the query parameter makes this approach very flexible and would cover almost any use cases.

The only difficulty in implementing this, I think, is managing the buffer memory space. As with the normal RenderResourceNode it would make sense to write all the values into the same buffer. The difference here is that values for the same draw call must lie continuously in the buffer, which gets complicated when new entities are added / removed.

I would like to try to implement this, but wanted to see if there is any feedback first.

@cart
Copy link
Member Author

cart commented Feb 6, 2021

I'm a little confused about the "single entity with RenderPipelines component" part. Where would the entity come from / how would we choose which one to use?

In general I like the idea, but I think the node should probably manage its own RenderResourceBindings and produce its own draw commands instead of storing that in a random entity.

Theres also the matter of using "dynamic uniform buffers" for each instance's data or "vertex attribute buffers". "vertex attribute buffers" might perform better, but they also have more limitations for data layout. If you were to google "gpu instancing example", in general you'd find vertex buffer implementations. We'd probably want to do a quick and dirty performance comparison of the two approaches before investing in one or the other.

@MDeiml
Copy link
Contributor

MDeiml commented Feb 6, 2021

I'm a little confused about the "single entity with RenderPipelines component" part. Where would the entity come from / how would we choose which one to use?

I don't really like this approach either, but I didn't see any other solution. The problem I'm trying to solve with this is that it would be wasteful to give each instanced entity it's own Draw and RenderPipelines component, as is the case with entities that have a MeshBundle. But the draw commands would have to be stored somewhere, and the only way to get the PassNode to render something is to store the draw commands in a Draw component of some entity, if I understand this correctly.

The other idea with this was to allow for really lightweight entities (e.g. particles), that wouldn't even need to have a Handle<Mesh> or Handle<Material>. But this information then would have to be stored somewhere else. It compromises ease of use for flexibility. I'm not really sure what's more important here.

In general I like the idea, but I think the node should probably manage its own RenderResourceBindings and produce its own draw commands instead of storing that in a random entity.

This would generally be nicer, but as I said, would mean changing the PassNode and storing the draw commands somewhere else than in the World.

Also it would probably mean that there could only be one InstancedRenderResourcesNode at once, which I notice now is probably also a limitation of the general approach.

Theres also the matter of using "dynamic uniform buffers" for each instance's data or "vertex attribute buffers". "vertex attribute buffers" might perform better, but they also have more limitations for data layout. If you were to google "gpu instancing example", in general you'd find vertex buffer implementations. We'd probably want to do a quick and dirty performance comparison of the two approaches before investing in one or the other.

I will try and test that. There is also "storage buffers", but I guess they are strictly worse than uniform buffers for this.

@MDeiml
Copy link
Contributor

MDeiml commented Feb 6, 2021

Uniform buffers seem to have a (minimum) size limit of 16384 (https://docs.rs/wgpu/0.7.0/wgpu/struct.Limits.html#structfield.max_uniform_buffer_binding_size) meaning about 256 instances if each instance needs a mat4. After some google searches I would guess that up to that limit they are faster than vertex buffers, but then vertex buffers take over.

@expenses
Copy link

expenses commented Apr 8, 2021

I don't use Bevy so I don't really have much at stake here, but I think this issue is a really important one to solve. Currently the 3D spawner example runs at about 1 frame per 3 seconds for me on my little laptop iGPU, with only about 15% gpu usage at maximum (according to intel_gpu_top). The rest of the time is probably spent in the driver copying buffers and recording draw calls.

@Skareeg
Copy link

Skareeg commented May 7, 2021

I just want to leave this link here, as it is related to the topic at hand, and it states that using the Uniform Buffers like that has performance implementations:
https://sotrh.github.io/learn-wgpu/beginner/tutorial7-instancing/

It is the draw_indexed function of WGPU. Not sure how that helps, as I am not clear on Bevy's rendering system quite yet.

@Pauan
Copy link

Pauan commented May 27, 2021

Instancing is a very complex subject. Babylon supports 3 different types of instancing:

  • Normal instances which are found in basically every game engine.
  • Thin instances are faster than normal instances, but they are not frustrum culled (so either all the instances are drawn or none of them are drawn).
  • Particle instances which renders a mesh on each particle. It supports multiple different meshes for a single particle system.

In addition, there are other useful types of instancing, such as GPU skeleton animation instancing. This allows you to have multiple instanced objects (with a skeleton) and each object is playing a different animation:

http://developer.download.nvidia.com/SDK/10/direct3d/Source/SkinnedInstancing/doc/SkinnedInstancingWhitePaper.pdf

https://forum.babylonjs.com/t/vertex-animation-textures/6325

https://forum.babylonjs.com/t/animations-and-performance-tips-and-tricks/20107/4

https://www.html5gamedevs.com/topic/32313-instancedmesh-with-separate-skeleton/?tab=comments#comment-185468

Of course it will take time to implement all of that, so it should be a long term goal, but it should be kept in mind when designing the system.

@MDeiml
Copy link
Contributor

MDeiml commented Oct 4, 2021

The first thing to settle should probably be how instancing should look from the user / ecs perspective. There are a few components we need to worry about:

  • Draw for storing draw calls (also Visible but I'm going to ignore it for now)
  • Resources that are instanced (e.g. Transform). These should be read by a InstancedRenderResourcesNode of some kind and stored in a uniform or vertex buffer
  • Resources that are not instanced (e.g. Material, Mesh). These have to be the same for every instance in the same draw call
  • Some kind of marker specifying which entities to instance together. Lets call it Instanced

Now as has come up here there are at least two different use cases that could be supported:

  1. Normal "automatic" instancing for situations where we need good performance, but not the best performance. Entities should probably look the same as normal entities with some added marker. So the typical entity would have components like Transform, Material, Mesh, Draw, Instanced, ... (here Instanced would maybe not have to contain any additional information, as entities could be automatically grouped by their material and mesh)
  2. "Manual" instancing, where performance is very critical for example in particle systems or situations where a very large number of instances is needed. For this entities should be kept very slim and not have any Draw component or have separate copies of the not instanced resources (Material, Mesh, ...). Instead this data should be stored in some "parent" entity. Here a typical entity could for example only have the components Instanced and Transform. Instanced here should contain some instance_id or reference to the parent. All the other information would be stored in the "parent" entity which would have the components Material, Mesh, Draw and some reference to the instance_id

I think it should be possible to cover both use cases with one implementation, though use case 1 is definitely more important in the short term. I'd be happy to hear what others think about this.

@jpryne
Copy link

jpryne commented Jan 27, 2022

In the 3D coding examples, we just add default plugins, and add some 3D geometry. Done. The crafty 3D programmer wants to plan draw calls, saving time by using instanced rendering.

So, perhaps we need an "Instanced" Component that can be implemented for a piece of 3D geometry? This Component needs a model, a shader, and one or more vertex buffers describing instance positions, colors or whatever else is referenced in the shader(s).

If the renderer gives us access to frustrum culling, we can easily add thin instances too.

Particle instancing shouldn't be hard. We just need a way to group the 3D models and refer to them as a group.

I will start by reading through the renderer code and seeing how we're setting up draw calls.

@rib
Copy link
Contributor

rib commented Mar 2, 2022

A bit of a fly-by comment, but I just wanted to bring up the possibility of supporting instanced stereo rendering for VR/AR, which is likely to interact with any general abstraction for supporting instances.

The main thing that I think is relevant here is that any abstraction Bevy introduces for rendering instances should internally reserve the ability to basically submit 2x the number of instances as requested so then shaders can use modulo arithmetic on the instanceId to lookup per-eye state, such as the per-eye projection matrix.

I'm not sure if Bevy already has any kind of macro system for shaders but for being able to write portable shaders that can work with or without instanced stereo rendering it would then be good to have a standard bevy macro for accessing the 'instance id' for applications (which might be divided by two when hiding the implementation details of the stereo rendering so as to not break how the shaders handle their per-instance data)

@expenses
Copy link

expenses commented Mar 3, 2022

@rib I just want to say that the ideal way to do stereo rendering for VR would be to use an api extension such as VK_KHR_muktiview for Vulkan: https://www.saschawillems.de/blog/2018/06/08/multiview-rendering-in-vulkan-using-vk_khr_multiview/

Anything else sounds like it would get messy, fast.

@rib
Copy link
Contributor

rib commented Mar 5, 2022

Thanks @expenses yeah good to bring up. As it happens I'm reasonably familiar with these extensions since I used to work on GPU drivers at Intel and even implemented the OpenGL OVR_multiview equivalent of this extension.

I suppose I tend to think the extensions themselves aren't that compelling since what they do can generally be done without the extension (they don't depend on special hardware features), but also I recall conversations within Khronos (years ago at this point, so things might have changed), where at least one other major vendor was intentionally avoiding implementing these extensions due to them being somewhat redundant (basically a utility over what apps can do themselves) so I guess I wouldn't be surprised if the extensions aren't available across all platforms.

I haven't really kept up with what different vendors support now though, so maybe the extensions really are available across the board now. A quick search seems to suggest the vulkan extension is pretty widely supported now (https://vulkan.gpuinfo.org/listdevicescoverage.php?extension=VK_KHR_multiview) but the story is still maybe a bit more ugly for GL.

I think it's perhaps still worth keeping in mind being able to support single pass stereo rendering through instancing without necessarily always having a multiview extension available. Being able to 2x the requested instance count for draw commands could potentially be pretty simple to support so I wouldn't necessarily assume it would be that messy to support if it's considered when thinking about how to support gpu instancing in bevy. Some of the requirements like framebuffer config details or viewid shader modifications are going to be pretty similar with or without any multiview extensions I'd guess - it's mainly the bit about doubling the instance counts that would be unique to a non-multiview path.

For reference, in unity they seem to support this ok, and the main caveat is with indirect drawing commands where the engine can't practically intercept the requested instance counts and so they document for those special cases that you have to 2x the instance count yourself.

@alice-i-cecile
Copy link
Member

@superdump, my impression is that this is now supported. Is that correct?

@cart
Copy link
Member Author

cart commented Apr 26, 2022

It is supported via our low / mid-level renderer apis, by nature of wgpu supporting it (and we have an example illustrating this). I don't think we should close this because what we really need is high level / automatic support for things like meshes drawn with StandardMaterial.

@superdump
Copy link
Contributor

Yeah. My gut is leading me to sorting out data preparation stuff first. So things around compressed textures and ways of managing data with uniform/storage/texture/vertex/instance buffers, including using textures for non-image data and things like texture atlases and memory paging to make blobs of memory suitable for loading stuff in/out without having massive holes. Then I imagine as that enables more things to be done, bindings (bind group layouts and bind groups) would become the focus to be able to batch things together. At least for me, fiddling with these things will increase my understanding and lead me toward figuring out how to do batching in a good and automatic way. This thread is useful for understanding use cases too, like VR stereo and the different ways that Babylon.js does instanced batching.

@MDeiml
Copy link
Contributor

MDeiml commented May 12, 2022

Unity has now moved away from GPU instancing and instead relies more on the "SRP Batcher" (https://docs.unity3d.com/Manual/GPUInstancing.html, https://docs.unity3d.com/Manual/SRPBatcher.html). The SRP Batcher basically reorders draw calls to reduce render state switches. It seems that they came to the conclusion, that (at least for Unity) draw call batching is more performant than instancing. Maybe Bevy should also go that route, seeing that it would also mean that shaders don't have to be set up for instancing.

From what I understand bevy at the moment orders all draw calls by distance. For transparent object that's necessary, but for opaque / alpha-masked objects the performance benefit of reducing overdraw by ordering draw calls should probably be less than optimizing for less state switches.

Now admittedly I'm not an expert in this, so maybe someone with more experience in graphics programming could give their opinion on this?

But I this shouldn't be too hard to implement since we already have code to e.g. collect all uniforms into one buffer.

EDIT: I think I was mistaken. In Unity the SRP Batcher doesn't even reorder anything. It just avoids state switches by remembering the current state, so an implementation would probably only mean minor changes in bevy_core_pipeline. Or is there something I'm missing?

@Pauan
Copy link

Pauan commented May 12, 2022

@MDeiml Note that Bevy uses WebGPU and Vulkan, so the cost of context switching is going to be very different compared to something like OpenGL. So any decisions should be based on benchmarking real Bevy apps, to make sure that we're not optimizing for the wrong thing.

@dyc3
Copy link

dyc3 commented May 18, 2022

Unity has now moved away from GPU instancing... It seems that they came to the conclusion, that (at least for Unity) draw call batching is more performant than instancing.

This doesn't really seem to be the case. https://forum.unity.com/threads/confused-about-performance-of-srp-batching-vs-gpu-instancing.949185/

According to this thread, GPU instancing should perform better than the SRP Batcher, but it is only applicable when all the instances share the same shader and mesh.

@rib
Copy link
Contributor

rib commented May 18, 2022

I was meaning to leave a comment about this too...

GPU instancing is a lower-level capability supported by hardware which makes it efficient to draw the same geometry N times with constrained material / transform changes made per-instance, considering that there is no back and fourth between the CPU and GPU for all of those instances.

It's not really a question of using instancing vs batching, they are both useful tools for different (but sometimes overlapping) problems.

If you happen to need to draw lots of the same mesh and the materials are compatible enough that you can practically describe the differences via per-instance state then then instancing is likely what you want ideally.

On the other hand if you have lots of smallish irregular primitives that are using compatible materials (or possibly larger primitives that you know are static and can be pre-processed) then there's a good chance it's worth manually batching them by essentially software transforming them into a single mesh, and sometimes transparently re-ordering how things are drawn for the sake of avoiding material changes. Batching can be done at varying levels of the stack with more or less knowledge about various constraints that might let it make fast assumptions e.g. for cheap early culling and brazen re-ordering that allows for more aggressive combining of geometry.

Unity's SRP batching is quite general purpose so it's probably somewhat constrained in how aggressive it can be without making a bad trade off in terms of how much energy is wasted trying to batch. On the other hand UI abstractions can often batch extremely aggressively.

Tiny quads, e.g. for a UI could be an example of an overlap where it might not always be immediately obvious whether to instance or batch. Quads are trivial to transform on the CPU and you can easily outstrip the per-drawcall overhead (especially with OpenGL) and it's potentially worth cpu transforming and re-ordering for optimal batching compared to submitting as instances where you'd also have to upload per-quad transforms.

@dyc3
Copy link

dyc3 commented May 20, 2022

Let's get back on track. I'm going to summarize the conversation so far just to make sure we are on the same page. Let me know if I missed anything and I'll update this comment so we can keep this conversation a little less cluttered.

The Conversation So Far

What is GPU Instancing?

GPU Instancing is a rendering optimization that allows users to render the same object lots of times in a single draw call. This avoids wasting time repeatedly sending the mesh and shader to the GPU for each instance. Each instance has parameters that changes how it's rendered (eg. position).

Current Status

We have successfully determined that GPU instancing is a worthwhile effort. We have also established that instancing is different from batching. GPU instancing is technically currently possible in Bevy, as shown in this example, but this is only possible through low level APIs. This example also requires disabling frustum culling, which doesn't seem ideal. This issue is about making GPU instancing more easily accessible to users.

In order to use instancing, the objects in question must share the same shader and mesh. The instances are provided instance data that contains data unique to that instance of the object (eg. position, rotation, scale).

This will take the form of 2 use cases, both of which seem reasonably feasible and should be easy enough to cover in a single implementation:

  1. Automatic instancing, where Bevy just does it automatically.
  2. Instancing with custom user defined parameters. The user has a custom shader that can take custom parameters, and each instanced entity has a component to provide the custom parameters to provide to the shader.

The VK_KHR_multiview vulkan extension and the OVR_multiview OpenGL extension should adequately handle instanced objects for VR applications, but it could be possible for these extensions to not be available. @rib suggested being able to submit 2x the requested instance amount as a workround when multiview is not available.

What we need to decide

  • The user facing API. So far, it's a little unclear what the user facing API will look like.
  • High level implementation details, see this comment and this comment.

How Other People Do It

There are plenty of other engines that implement instancing. These may be useful to reference when we are designing the user facing API.

Meshes that have a low number of vertices can’t be processed efficiently using GPU instancing because the GPU can’t distribute the work in a way that fully uses the GPU’s resources. This processing inefficiency can have a detrimental effect on performance. The threshold at which inefficiencies begin depends on the GPU, but as a general rule, don’t use GPU instancing for meshes that have fewer than 256 vertices.
If you want to render a mesh with a low number of vertices many times, best practice is to create a single buffer that contains all the mesh information and use that to draw the meshes.

@rib
Copy link
Contributor

rib commented May 20, 2022

I think a particular detail that's worth highlighting under 'How Other People Do It', looking at Unity is that they provide a macro for accessing the instance ID in shaders that should be used to ensure the engine has the wiggle room it needs to be able to change the number of instances submitted in a way that's transparent to the application.

Ref: https://github.com/TwoTailsGames/Unity-Built-in-Shaders/blob/master/CGIncludes/UnityInstancing.cginc

I'm not familiar yet with whether Bevy has any kind of similar macro system for shaders, but something comparable could make sense.

@micahscopes
Copy link

micahscopes commented May 20, 2022

Just wanna share a usecase that I'm interested in and thinking a lot about: emulated dynamic tessellation.

I'm working on implementing some special parametric surface patches that can fluctuate between being extremely small and extremely large very quickly. When they're small I want to render them as single cells but when they're large I want to compute lots of interpolated detail. So I made an atlas of pre-computed tesselations of unit tris/quads that blend between arbitrary levels of detail on each side. The tessellation geometry then gets instance drawn for each surface patch using uniforms representing the corners and interpolation parameters of each patch. These uniforms and the LOD levels get pre-computed in a compute pass with transform feedback.

Initial prototypes using WebGL2 with a single level of detail have shown surprisingly good performance for rendering a lot of instances of a single high LOD tessellation mesh. Dynamic LOD rendering is a little trickier and still a work in progress but the idea is similar. The plan is to combine all of the tessellation levels into a single mesh and then to make use of the WEBGL_multi_draw_instanced_base_vertex_base_instance draft extension for WebGL2. This will allow rendering multiple arbitrary sections of the instance array (uniforms for one or more patches) over multiple arbitrary sections of the mesh (various LOD tessellations) all using a single draw call. Coming up with the draw call parameters will be a little tricky. For WebGL2 this needs to happen on the CPU since there's no indirect rendering, but I have a scheme in mind to make it quick by precomputing an index of neighboring patches.

As for WebGPU, there's not multidraw support yet but this will come eventually. In the meantime WebGPU already supports instance drawing starting from an arbitrary firstInstance of instance array.

My wish now is for a glTF/PBR renderer that could draw like this but in a pluggable way. Aside from rendering parametric/generative surfaces (e.g. terrain, Bezier surfaces), this could also be used for displacement mapping.

@superdump
Copy link
Contributor

For the general discussion: I have been thinking about this and playing around with instancing and batching in a separate repo. I would say:

GPU instancing is specifically drawing the same mesh from a vertex buffer, optional index buffer, and instance buffer (vertex buffer that is stepped at instance rate) by passing the range of instances to draw to the instanced draw command.

But, as noted in the Unity documentation, GPU instancing is inefficient for drawing many instances of a mesh with few vertices as GPUs spawn work across 32/64/etc threads at a time and if they can’t due to there only being 4 vertices to process for a quad for example then the rest of the threads in a ‘warp’ or ‘wavefront’ are left idle which is called having low occupancy and leaves performance on the table.

As such, I think it is very important to consider other ways of instancing and also consider batching. So I should define what I understand those terms to mean.

General instancing is using the tools available to draw many instances of a mesh, not necessarily by passing the range of instances to be drawn to a draw command.

Batching is using the tools available to merge multiple draw commands into fewer draw commands. It was noted already that merging draw calls for APIs like OpenGL is much more significant a benefit than doing the same for modern APIs, but there is still benefit to be had.

Also of consideration here is that generally speaking if a data binding that is used in a bind group has to change between two things being drawn, then it requires two separate draw commands to be able to rebind that thing in between. So batching is a lot about finding ways to avoid having to rebind data bindings and instead looking up the data based on the available indices.

I’ve been fiddling and learning and thinking a lot about all of the constraints and flexibilities provided by the tools (as in the wgpu APIs as a proxy to the other graphics APIs) and various ideas have been forming.

bevy_sprites instances quads by writing all instance data as quad vertex attributes. So if you have a flat sprite colour for example, that would be per vertex, not per instance. The downside of this is lots of duplicated data. Also for the vertex positions as each of the four vertices (or maybe six if there is no index buffer? I don’t remember) have to have positions and uvs. The upside is complete flexibility of those positions so that they can be absolutely transformed by a global transform.

In my bevy-vertex-pulling repository I have implemented two commonly-requested things: drawing as many quads as possible and drawing as many cubes as possible. Using vertex pulling and specially-crafted index buffers, the instances of quads or cubes can be drawn without a vertex buffer, using only per-instance data for the position and half extents. The vertex index is used to calculate both the instance index and the vertex index within the quad/cube. The cube approach is also a bit special because it only draws the three visible faces of the cube. They also output uvw coordinates and normals as necessary. At some point I would like to try using this approach for bevy_sprites, but that is already quite fast so it doesn’t feel like the highest priority, plus it depends on what transformations need to be able to be made on sprites. Translation and scale are supported and rotation could be added but also supporting shears would require a matrix per instance I guess and maybe that ends up not being worth it vs explicit vertices for quads, it would depend.

Drawing many arbitrary meshes with non-trivial shapes, so models with more than 256 vertices, are perhaps well-suited to using GPU instancing as they are also likely to use the same materials perhaps.

The bevy-vertex-pulling experiments are not done yet. I want to try out some more things to understand when different things help performance-wise. For example, bevy_sprites doesn’t use a depth buffer, so for opaque sprites it relies on draw order to place things on top of each other. That also means the same screen fragment is shaded multiple times which means that there are multiple sprite texture data fetches per screen fragment. Even if that doesn’t practically matter on high-powered devices, it could well matter on mobile where bandwidth is much more constrained. This is repeated shading of the same fragment generally called overdraw. To avoid overdraw you can do things like occlusion culling to just not draw things that are occluded by other opaque things in front of them, or use a depth buffer which will do this as part of rasterisation, only shading fragments in front of other fragments. And then you sort opaque meshes front to back to capitalise on the early-z testing that is done as part of rasterisation in order to skip shading occluded fragments. This only applies to opaque however. But it does raise the sorting aspect.

Batching involves lots of sorting in order to group things that can be drawn together. And sorting is also needed to capitalise on reducing overdraw to avoid repeated fragment shading costs and texture bandwidth and so on. Sorting many items can be expensive time-wise. I have done experiments with radix sorting and parallel sorting elsewhere for bevy_sprites where we sort twice, both in the queue stage before ‘pre-batching’ and then again in the sort phase after mesh2d and other custom things may have been queued to render phases that would then require splitting batches of quads. As such, bevy_sprites currently queues each sprite as an individual phase item with additional batch metadata and which means the sort phase has to sort every sprite again, and the batch phase merges the phase items into batches as much as it can, recognising that if an incompatible phase item falls within a batch, then those batch items cannot all be merged.

Now, yet another aspect is how instance data is stored. The options available are vertex/instance buffers (supported everywhere but cannot be arbitrarily indexed from within the shader so only works if you actually want to draw many instances of the same mesh and the mesh has a good amount of vertices), uniform buffers (broadly supported but limited size to 16kB minimum and only fixed-size arrays), storage buffers (variable-size arrays but only one per binding, and much larger sizes, not supported on WebGL2), data textures (broad support, large amounts of data, requires custom data packing/unpacking so will be unergonomic to use). For bevy-vertex-pulling I have used storage buffers as they are simple, flexible, and perform well. Long-term, they’re great. But given WebGL2 support is desired, we will have to support using one of the others. Perhaps just using more but smaller batches with uniform buffers would be sufficiently good.

To me, GPU instancing is a pretty small aspect of how to handle reducing draw commands and efficiently drawing lots of things. It’s a bit too constrained. Instead I suspect other, more flexible batching methods are more generally useful.

Ultimately the end goal is to have one draw command to draw everything in view. If we look again at the data bindings used for rasterisation, we have (possibly) vertex buffer, index buffer, uniform buffer, storage buffer, texture view, and/or sampler bindings.

So far I mostly referred to putting per-instance mesh and material data into uniform/storage/data texture buffers, but if you have separate vertex buffers for your meshes, you will still have to rebind per mesh. You can merge all your mesh data into one big vertex buffer by handling generic non-trivial mesh instances as meshlets - break them up into groups of primitives such as (I saw this suggested somewhere) 45 vertices to represent 15 triangles in a triangle list. And if the mesh doesn’t fill that many, then you pad with degenerate triangles. Each meshlet has corresponding meshlet instance data. This way you can pack all vertex data into one big buffer and never have to rebind it.

That leaves texture bindings. Unfortunately bindless texture arrays are still fairly new and not incredibly broadly supported outside of desktop. But with those, we can have arrays of arbitrary textures, bind the array, and store indices into textures in material data. And then we’re almost done. Otherwise, we could enforce that all our textures have the same x,y size and put them into an array texture and store the layer index in material data. Or use a texture atlas either with virtual texturing at runtime which would add a lot of complexity I expect, or offline as part of the asset pipeline. Those options are in increasing order of breadth of support, though 2d array textures are practically supported everywhere it seems, and I guess decreasing in ergonomics / simplicity.

One more stick in the mud is transparency. Currently we have an order-dependent transparency method which requires sorting meshes back to front for correct blending using the over operator. If we had an order-independent method such as weighted-blended order-independent transparency, then we wouldn’t have to sort the transparent phase.

My understanding is that then once we have a fully bindless setup, we can move to GPU-driven draw command generation by using indirect draw commands as we can write all the indices for materials and meshes and such into storage buffers in compute shaders, as well as the draw commands themselves. This provides an enormous performance boost where supported. With WebGPU we should have compute and at least some indirect draw support (single indirect, but no bindless texture arrays yet, I think) but for native desktop it would probably be the practical default basically everywhere?

As the ultimate goal is performance, I think we need to consider the journey that we are on, what the parameters, flexibilities, and constraints are, and then figure out what steps to take. I think this is necessary because I think we can put together a flexible and useful solution that supports different approaches depending on platform support and user needs. I’m getting there in my learnings as you can see from the above but I’m not quite there yet. My primary next steps are to experiment with the impact of using a depth buffer on overdraw for simple millions of quads with low fragment shader cost (pure data and trivial maths) both with and without sorting front to back for opaque, then with texture fetches (so like sprites), and then try out the single vertex buffer approach.

@micahscopes could you share your code?

@gents83
Copy link

gents83 commented Feb 22, 2023

I asked cwfitzgerald what they learned from trying the meshlet index buffer approach and they said:

I decided not to go with meshlets because it is a very intrusive data model. Because each meshlet needs a copy of its vertices, you need to manage duplicating and allocating the vertices. Now imagine a use case where meshes are generated on the gpu by a compute shader or such (think oceans or whatever). It's going to be very difficult molding the output into meshlets at all, let alone efficiently.

@superdump if you are still reflecting on this I've played a bit with gpu indirect rendering, gpu culling, meshes and meshlets buffers, visibility buffer and now I'm playing with compute shaders raytracing (and soon wavefront pathtracing hopefully) in wgpu in my prototype engine:
https://github.com/gents83/INOX

@Shfty
Copy link
Contributor

Shfty commented Feb 22, 2023

I've updated my instancing crate with 0.9 compatibility, as well as clarifying that it's licensed under the same terms as bevy.

The examples still run, so evidently the render machinery hasn't changed enough to break it. Status-wise it's as it was; a working proof of concept, but in need of a WebGL compatibility pass, a refactor to match present bevy_render / bevy_pbr idioms, and some optimization to prevent its systems from wasting perf iterating over non-instancing entities.

@igorhoogerwoord Thanks for the interest, I probably could use a hand if the offer still stands. My attention ended up shifting toward engineering the art that it'll ultimately end up rendering (and an associated rust-gpu rabbit hole 😅), but it's still relevant to my project, so would be good to get into a state that can either be PRed in or released as an extension crate.

@gilescope
Copy link
Contributor

I've stopped using bevy so that I could do instancing (my scenes are mostly cubes), and instancing makes a huge difference - an order of magnitude difference for frame rate in my case as I can use a handful of indexed instanced draw calls. Instancing must be getting high up in the priority list by now given the huge impact it can have?

@superdump
Copy link
Contributor

superdump commented Mar 12, 2023

@gilescope yes. I'm intending to start making incremental PRs toward it now. It will take some time.

@superdump
Copy link
Contributor

superdump commented Jul 21, 2023

  • Reorder render sets ( Reorder render sets #8062 Reorder render sets, refactor bevy_sprite to take advantage #9236 ) from extract, prepare, queue, sort+batch, render, to extract, prepare assets, queue, sort, prepare+batch, render to allow preparation of dynamic per-object, per-frame data to be based on sort order to enable batching, and because of how sprite batching relies on the old render set order, rewrite it to work with the new one.
    • Implement the batching foundation for 2D and 3D meshes ( Automatic batching/instancing of draw commands #9685 ) (split batches when anything used for drawing changes between two items, e.g. pipeline id, draw function id [as it defines how things are bound...], bind group ids, dynamic offsets, index and vertex buffer ids [and offsets?])
      • If the mesh buffers and instance/vertex offset + count are the same, use instancing instead.
  • GPU array buffer: Add GpuArrayBuffer and BatchedUniformBuffer #8204 - as in putting struct data into arrays in buffers and binding them, then using indices to index into the arrays so that bind group rebinding can be avoided between different objects.
  • Move mesh data into single large index/vertex buffers per vertex layout (or vertex attribute set and always lay out in the same way for a given set - same thing?)
    • Bonus points for separating out position data into its own vertex buffer as this should give better performance in prepasses and I think it is also part of some optimisations on mobile where position data is used to split shading into tiles.
  • Use bindless texture arrays for materials
  • Use indirect draws
  • Try vertex pulling with indices encoded as 8-bit object id within a batch + 24-bit vertex index
    • Use a compute shader for rewriting the index buffer - maybe necessary for performance
  • At some point try using mesh 2d for sprites instead of custom sprite batch

github-merge-queue bot pushed a commit that referenced this issue Jul 21, 2023
# Objective

- Add a type for uploading a Rust `Vec<T>` to a GPU `array<T>`.
- Makes progress towards #89.

## Solution

- Port @superdump's `BatchedUniformBuffer` to bevy main, as a fallback
for WebGL2, which doesn't support storage buffers.
- Rather than getting an `array<T>` in a shader, you get an `array<T,
N>`, and have to rebind every N elements via dynamic offsets.
- Add `GpuArrayBuffer` to abstract over
`StorageBuffer<Vec<T>>`/`BatchedUniformBuffer`.

## Future Work
Add a shader macro kinda thing to abstract over the following
automatically:
#8204 (review)

---

## Changelog
* Added `GpuArrayBuffer`, `GpuComponentArrayBufferPlugin`,
`GpuArrayBufferable`, and `GpuArrayBufferIndex` types.
* Added `DynamicUniformBuffer::new_with_alignment()`.

---------

Co-authored-by: Robert Swain <robert.swain@gmail.com>
Co-authored-by: François <mockersf@gmail.com>
Co-authored-by: Teodor Tanasoaia <28601907+teoxoy@users.noreply.github.com>
Co-authored-by: IceSentry <IceSentry@users.noreply.github.com>
Co-authored-by: Vincent <9408210+konsolas@users.noreply.github.com>
Co-authored-by: robtfm <50659922+robtfm@users.noreply.github.com>
github-merge-queue bot pushed a commit that referenced this issue Aug 27, 2023
This is a continuation of this PR: #8062 

# Objective

- Reorder render schedule sets to allow data preparation when phase item
order is known to support improved batching
- Part of the batching/instancing etc plan from here:
#89 (comment)
- The original idea came from @inodentry and proved to be a good one.
Thanks!
- Refactor `bevy_sprite` and `bevy_ui` to take advantage of the new
ordering

## Solution
- Move `Prepare` and `PrepareFlush` after `PhaseSortFlush` 
- Add a `PrepareAssets` set that runs in parallel with other systems and
sets in the render schedule.
  - Put prepare_assets systems in the `PrepareAssets` set
- If explicit dependencies are needed on Mesh or Material RenderAssets
then depend on the appropriate system.
- Add `ManageViews` and `ManageViewsFlush` sets between
`ExtractCommands` and Queue
- Move `queue_mesh*_bind_group` to the Prepare stage
  - Rename them to `prepare_`
- Put systems that prepare resources (buffers, textures, etc.) into a
`PrepareResources` set inside `Prepare`
- Put the `prepare_..._bind_group` systems into a `PrepareBindGroup` set
after `PrepareResources`
- Move `prepare_lights` to the `ManageViews` set
  - `prepare_lights` creates views and this must happen before `Queue`
  - This system needs refactoring to stop handling all responsibilities
- Gather lights, sort, and create shadow map views. Store sorted light
entities in a resource

- Remove `BatchedPhaseItem`
- Replace `batch_range` with `batch_size` representing how many items to
skip after rendering the item or to skip the item entirely if
`batch_size` is 0.
- `queue_sprites` has been split into `queue_sprites` for queueing phase
items and `prepare_sprites` for batching after the `PhaseSort`
  - `PhaseItem`s are still inserted in `queue_sprites`
- After sorting adjacent compatible sprite phase items are accumulated
into `SpriteBatch` components on the first entity of each batch,
containing a range of vertex indices. The associated `PhaseItem`'s
`batch_size` is updated appropriately.
- `SpriteBatch` items are then drawn skipping over the other items in
the batch based on the value in `batch_size`
- A very similar refactor was performed on `bevy_ui`
---

## Changelog

Changed:
- Reordered and reworked render app schedule sets. The main change is
that data is extracted, queued, sorted, and then prepared when the order
of data is known.
- Refactor `bevy_sprite` and `bevy_ui` to take advantage of the
reordering.

## Migration Guide
- Assets such as materials and meshes should now be created in
`PrepareAssets` e.g. `prepare_assets<Mesh>`
- Queueing entities to `RenderPhase`s continues to be done in `Queue`
e.g. `queue_sprites`
- Preparing resources (textures, buffers, etc.) should now be done in
`PrepareResources`, e.g. `prepare_prepass_textures`,
`prepare_mesh_uniforms`
- Prepare bind groups should now be done in `PrepareBindGroups` e.g.
`prepare_mesh_bind_group`
- Any batching or instancing can now be done in `Prepare` where the
order of the phase items is known e.g. `prepare_sprites`

 
## Next Steps
- Introduce some generic mechanism to ensure items that can be batched
are grouped in the phase item order, currently you could easily have
`[sprite at z 0, mesh at z 0, sprite at z 0]` preventing batching.
 - Investigate improved orderings for building the MeshUniform buffer
 - Implementing batching across the rest of bevy

---------

Co-authored-by: Robert Swain <robert.swain@gmail.com>
Co-authored-by: robtfm <50659922+robtfm@users.noreply.github.com>
github-merge-queue bot pushed a commit that referenced this issue Sep 21, 2023
# Objective

- Implement the foundations of automatic batching/instancing of draw
commands as the next step from #89
- NOTE: More performance improvements will come when more data is
managed and bound in ways that do not require rebinding such as mesh,
material, and texture data.

## Solution

- The core idea for batching of draw commands is to check whether any of
the information that has to be passed when encoding a draw command
changes between two things that are being drawn according to the sorted
render phase order. These should be things like the pipeline, bind
groups and their dynamic offsets, index/vertex buffers, and so on.
  - The following assumptions have been made:
- Only entities with prepared assets (pipelines, materials, meshes) are
queued to phases
- View bindings are constant across a phase for a given draw function as
phases are per-view
- `batch_and_prepare_render_phase` is the only system that performs this
batching and has sole responsibility for preparing the per-object data.
As such the mesh binding and dynamic offsets are assumed to only vary as
a result of the `batch_and_prepare_render_phase` system, e.g. due to
having to split data across separate uniform bindings within the same
buffer due to the maximum uniform buffer binding size.
- Implement `GpuArrayBuffer` for `Mesh2dUniform` to store Mesh2dUniform
in arrays in GPU buffers rather than each one being at a dynamic offset
in a uniform buffer. This is the same optimisation that was made for 3D
not long ago.
- Change batch size for a range in `PhaseItem`, adding API for getting
or mutating the range. This is more flexible than a size as the length
of the range can be used in place of the size, but the start and end can
be otherwise whatever is needed.
- Add an optional mesh bind group dynamic offset to `PhaseItem`. This
avoids having to do a massive table move just to insert
`GpuArrayBufferIndex` components.

## Benchmarks

All tests have been run on an M1 Max on AC power. `bevymark` and
`many_cubes` were modified to use 1920x1080 with a scale factor of 1. I
run a script that runs a separate Tracy capture process, and then runs
the bevy example with `--features bevy_ci_testing,trace_tracy` and
`CI_TESTING_CONFIG=../benchmark.ron` with the contents of
`../benchmark.ron`:
```rust
(
    exit_after: Some(1500)
)
```
...in order to run each test for 1500 frames.

The recent changes to `many_cubes` and `bevymark` added reproducible
random number generation so that with the same settings, the same rng
will occur. They also added benchmark modes that use a fixed delta time
for animations. Combined this means that the same frames should be
rendered both on main and on the branch.

The graphs compare main (yellow) to this PR (red).

### 3D Mesh `many_cubes --benchmark`

<img width="1411" alt="Screenshot 2023-09-03 at 23 42 10"
src="https://github.com/bevyengine/bevy/assets/302146/2088716a-c918-486c-8129-090b26fd2bc4">
The mesh and material are the same for all instances. This is basically
the best case for the initial batching implementation as it results in 1
draw for the ~11.7k visible meshes. It gives a ~30% reduction in median
frame time.

The 1000th frame is identical using the flip tool:

![flip many_cubes-main-mesh3d many_cubes-batching-mesh3d 67ppd
ldr](https://github.com/bevyengine/bevy/assets/302146/2511f37a-6df8-481a-932f-706ca4de7643)

```
     Mean: 0.000000
     Weighted median: 0.000000
     1st weighted quartile: 0.000000
     3rd weighted quartile: 0.000000
     Min: 0.000000
     Max: 0.000000
     Evaluation time: 0.4615 seconds
```

### 3D Mesh `many_cubes --benchmark --material-texture-count 10`

<img width="1404" alt="Screenshot 2023-09-03 at 23 45 18"
src="https://github.com/bevyengine/bevy/assets/302146/5ee9c447-5bd2-45c6-9706-ac5ff8916daf">
This run uses 10 different materials by varying their textures. The
materials are randomly selected, and there is no sorting by material
bind group for opaque 3D so any batching is 'random'. The PR produces a
~5% reduction in median frame time. If we were to sort the opaque phase
by the material bind group, then this should be a lot faster. This
produces about 10.5k draws for the 11.7k visible entities. This makes
sense as randomly selecting from 10 materials gives a chance that two
adjacent entities randomly select the same material and can be batched.

The 1000th frame is identical in flip:

![flip many_cubes-main-mesh3d-mtc10 many_cubes-batching-mesh3d-mtc10
67ppd
ldr](https://github.com/bevyengine/bevy/assets/302146/2b3a8614-9466-4ed8-b50c-d4aa71615dbb)

```
     Mean: 0.000000
     Weighted median: 0.000000
     1st weighted quartile: 0.000000
     3rd weighted quartile: 0.000000
     Min: 0.000000
     Max: 0.000000
     Evaluation time: 0.4537 seconds
```

### 3D Mesh `many_cubes --benchmark --vary-per-instance`

<img width="1394" alt="Screenshot 2023-09-03 at 23 48 44"
src="https://github.com/bevyengine/bevy/assets/302146/f02a816b-a444-4c18-a96a-63b5436f3b7f">
This run varies the material data per instance by randomly-generating
its colour. This is the worst case for batching and that it performs
about the same as `main` is a good thing as it demonstrates that the
batching has minimal overhead when dealing with ~11k visible mesh
entities.

The 1000th frame is identical according to flip:

![flip many_cubes-main-mesh3d-vpi many_cubes-batching-mesh3d-vpi 67ppd
ldr](https://github.com/bevyengine/bevy/assets/302146/ac5f5c14-9bda-4d1a-8219-7577d4aac68c)

```
     Mean: 0.000000
     Weighted median: 0.000000
     1st weighted quartile: 0.000000
     3rd weighted quartile: 0.000000
     Min: 0.000000
     Max: 0.000000
     Evaluation time: 0.4568 seconds
```

### 2D Mesh `bevymark --benchmark --waves 160 --per-wave 1000 --mode
mesh2d`

<img width="1412" alt="Screenshot 2023-09-03 at 23 59 56"
src="https://github.com/bevyengine/bevy/assets/302146/cb02ae07-237b-4646-ae9f-fda4dafcbad4">
This spawns 160 waves of 1000 quad meshes that are shaded with
ColorMaterial. Each wave has a different material so 160 waves currently
should result in 160 batches. This results in a 50% reduction in median
frame time.

Capturing a screenshot of the 1000th frame main vs PR gives:

![flip bevymark-main-mesh2d bevymark-batching-mesh2d 67ppd
ldr](https://github.com/bevyengine/bevy/assets/302146/80102728-1217-4059-87af-14d05044df40)

```
     Mean: 0.001222
     Weighted median: 0.750432
     1st weighted quartile: 0.453494
     3rd weighted quartile: 0.969758
     Min: 0.000000
     Max: 0.990296
     Evaluation time: 0.4255 seconds
```

So they seem to produce the same results. I also double-checked the
number of draws. `main` does 160000 draws, and the PR does 160, as
expected.

### 2D Mesh `bevymark --benchmark --waves 160 --per-wave 1000 --mode
mesh2d --material-texture-count 10`

<img width="1392" alt="Screenshot 2023-09-04 at 00 09 22"
src="https://github.com/bevyengine/bevy/assets/302146/4358da2e-ce32-4134-82df-3ab74c40849c">
This generates 10 textures and generates materials for each of those and
then selects one material per wave. The median frame time is reduced by
50%. Similar to the plain run above, this produces 160 draws on the PR
and 160000 on `main` and the 1000th frame is identical (ignoring the fps
counter text overlay).

![flip bevymark-main-mesh2d-mtc10 bevymark-batching-mesh2d-mtc10 67ppd
ldr](https://github.com/bevyengine/bevy/assets/302146/ebed2822-dce7-426a-858b-b77dc45b986f)

```
     Mean: 0.002877
     Weighted median: 0.964980
     1st weighted quartile: 0.668871
     3rd weighted quartile: 0.982749
     Min: 0.000000
     Max: 0.992377
     Evaluation time: 0.4301 seconds
```

### 2D Mesh `bevymark --benchmark --waves 160 --per-wave 1000 --mode
mesh2d --vary-per-instance`

<img width="1396" alt="Screenshot 2023-09-04 at 00 13 53"
src="https://github.com/bevyengine/bevy/assets/302146/b2198b18-3439-47ad-919a-cdabe190facb">
This creates unique materials per instance by randomly-generating the
material's colour. This is the worst case for 2D batching. Somehow, this
PR manages a 7% reduction in median frame time. Both main and this PR
issue 160000 draws.

The 1000th frame is the same:

![flip bevymark-main-mesh2d-vpi bevymark-batching-mesh2d-vpi 67ppd
ldr](https://github.com/bevyengine/bevy/assets/302146/a2ec471c-f576-4a36-a23b-b24b22578b97)

```
     Mean: 0.001214
     Weighted median: 0.937499
     1st weighted quartile: 0.635467
     3rd weighted quartile: 0.979085
     Min: 0.000000
     Max: 0.988971
     Evaluation time: 0.4462 seconds
```

### 2D Sprite `bevymark --benchmark --waves 160 --per-wave 1000 --mode
sprite`

<img width="1396" alt="Screenshot 2023-09-04 at 12 21 12"
src="https://github.com/bevyengine/bevy/assets/302146/8b31e915-d6be-4cac-abf5-c6a4da9c3d43">
This just spawns 160 waves of 1000 sprites. There should be and is no
notable difference between main and the PR.

### 2D Sprite `bevymark --benchmark --waves 160 --per-wave 1000 --mode
sprite --material-texture-count 10`

<img width="1389" alt="Screenshot 2023-09-04 at 12 36 08"
src="https://github.com/bevyengine/bevy/assets/302146/45fe8d6d-c901-4062-a349-3693dd044413">
This spawns the sprites selecting a texture at random per instance from
the 10 generated textures. This has no significant change vs main and
shouldn't.

### 2D Sprite `bevymark --benchmark --waves 160 --per-wave 1000 --mode
sprite --vary-per-instance`

<img width="1401" alt="Screenshot 2023-09-04 at 12 29 52"
src="https://github.com/bevyengine/bevy/assets/302146/762c5c60-352e-471f-8dbe-bbf10e24ebd6">
This sets the sprite colour as being unique per instance. This can still
all be drawn using one batch. There should be no difference but the PR
produces median frame times that are 4% higher. Investigation showed no
clear sources of cost, rather a mix of give and take that should not
happen. It seems like noise in the results.

### Summary

| Benchmark  | % change in median frame time |
| ------------- | ------------- |
| many_cubes  | 🟩 -30%  |
| many_cubes 10 materials  | 🟩 -5%  |
| many_cubes unique materials  | 🟩 ~0%  |
| bevymark mesh2d  | 🟩 -50%  |
| bevymark mesh2d 10 materials  | 🟩 -50%  |
| bevymark mesh2d unique materials  | 🟩 -7%  |
| bevymark sprite  | 🟥 2%  |
| bevymark sprite 10 materials  | 🟥 0.6%  |
| bevymark sprite unique materials  | 🟥 4.1%  |

---

## Changelog

- Added: 2D and 3D mesh entities that share the same mesh and material
(same textures, same data) are now batched into the same draw command
for better performance.

---------

Co-authored-by: robtfm <50659922+robtfm@users.noreply.github.com>
Co-authored-by: Nicola Papale <nico@nicopap.ch>
@superdump
Copy link
Contributor

I learned from Sebastian Aaltonen's tweets that dynamic indexing (such as an instance index) into arrays in uniform or storage buffers can be quite slow on low end Android (e.g. $99). One thing Sebastian noted was that a vec4 read from a dynamically indexed array cost about 0.7ms each.

I think for CPU-driven, perhaps using an instance-rate vertex buffer could be faster. And for materials that only have one mesh instance, a single uniform binding may perform better.

@TirushOne
Copy link

Hay guys. I hope to make my own contribution to the bevy engine some day. But for now I am just a lurker eagerly watching the delelopment of the bevy engine. i was just wondering whether currently (at least in the repo) brevy supports automatic gpu instancing of meshes / has a high level api to support it. Because what it seems from what I can gather here that good progress has been made, but no one has really agreed on a user facing api yet

rdrpenguin04 pushed a commit to rdrpenguin04/bevy that referenced this issue Jan 9, 2024
# Objective

- Implement the foundations of automatic batching/instancing of draw
commands as the next step from bevyengine#89
- NOTE: More performance improvements will come when more data is
managed and bound in ways that do not require rebinding such as mesh,
material, and texture data.

## Solution

- The core idea for batching of draw commands is to check whether any of
the information that has to be passed when encoding a draw command
changes between two things that are being drawn according to the sorted
render phase order. These should be things like the pipeline, bind
groups and their dynamic offsets, index/vertex buffers, and so on.
  - The following assumptions have been made:
- Only entities with prepared assets (pipelines, materials, meshes) are
queued to phases
- View bindings are constant across a phase for a given draw function as
phases are per-view
- `batch_and_prepare_render_phase` is the only system that performs this
batching and has sole responsibility for preparing the per-object data.
As such the mesh binding and dynamic offsets are assumed to only vary as
a result of the `batch_and_prepare_render_phase` system, e.g. due to
having to split data across separate uniform bindings within the same
buffer due to the maximum uniform buffer binding size.
- Implement `GpuArrayBuffer` for `Mesh2dUniform` to store Mesh2dUniform
in arrays in GPU buffers rather than each one being at a dynamic offset
in a uniform buffer. This is the same optimisation that was made for 3D
not long ago.
- Change batch size for a range in `PhaseItem`, adding API for getting
or mutating the range. This is more flexible than a size as the length
of the range can be used in place of the size, but the start and end can
be otherwise whatever is needed.
- Add an optional mesh bind group dynamic offset to `PhaseItem`. This
avoids having to do a massive table move just to insert
`GpuArrayBufferIndex` components.

## Benchmarks

All tests have been run on an M1 Max on AC power. `bevymark` and
`many_cubes` were modified to use 1920x1080 with a scale factor of 1. I
run a script that runs a separate Tracy capture process, and then runs
the bevy example with `--features bevy_ci_testing,trace_tracy` and
`CI_TESTING_CONFIG=../benchmark.ron` with the contents of
`../benchmark.ron`:
```rust
(
    exit_after: Some(1500)
)
```
...in order to run each test for 1500 frames.

The recent changes to `many_cubes` and `bevymark` added reproducible
random number generation so that with the same settings, the same rng
will occur. They also added benchmark modes that use a fixed delta time
for animations. Combined this means that the same frames should be
rendered both on main and on the branch.

The graphs compare main (yellow) to this PR (red).

### 3D Mesh `many_cubes --benchmark`

<img width="1411" alt="Screenshot 2023-09-03 at 23 42 10"
src="https://github.com/bevyengine/bevy/assets/302146/2088716a-c918-486c-8129-090b26fd2bc4">
The mesh and material are the same for all instances. This is basically
the best case for the initial batching implementation as it results in 1
draw for the ~11.7k visible meshes. It gives a ~30% reduction in median
frame time.

The 1000th frame is identical using the flip tool:

![flip many_cubes-main-mesh3d many_cubes-batching-mesh3d 67ppd
ldr](https://github.com/bevyengine/bevy/assets/302146/2511f37a-6df8-481a-932f-706ca4de7643)

```
     Mean: 0.000000
     Weighted median: 0.000000
     1st weighted quartile: 0.000000
     3rd weighted quartile: 0.000000
     Min: 0.000000
     Max: 0.000000
     Evaluation time: 0.4615 seconds
```

### 3D Mesh `many_cubes --benchmark --material-texture-count 10`

<img width="1404" alt="Screenshot 2023-09-03 at 23 45 18"
src="https://github.com/bevyengine/bevy/assets/302146/5ee9c447-5bd2-45c6-9706-ac5ff8916daf">
This run uses 10 different materials by varying their textures. The
materials are randomly selected, and there is no sorting by material
bind group for opaque 3D so any batching is 'random'. The PR produces a
~5% reduction in median frame time. If we were to sort the opaque phase
by the material bind group, then this should be a lot faster. This
produces about 10.5k draws for the 11.7k visible entities. This makes
sense as randomly selecting from 10 materials gives a chance that two
adjacent entities randomly select the same material and can be batched.

The 1000th frame is identical in flip:

![flip many_cubes-main-mesh3d-mtc10 many_cubes-batching-mesh3d-mtc10
67ppd
ldr](https://github.com/bevyengine/bevy/assets/302146/2b3a8614-9466-4ed8-b50c-d4aa71615dbb)

```
     Mean: 0.000000
     Weighted median: 0.000000
     1st weighted quartile: 0.000000
     3rd weighted quartile: 0.000000
     Min: 0.000000
     Max: 0.000000
     Evaluation time: 0.4537 seconds
```

### 3D Mesh `many_cubes --benchmark --vary-per-instance`

<img width="1394" alt="Screenshot 2023-09-03 at 23 48 44"
src="https://github.com/bevyengine/bevy/assets/302146/f02a816b-a444-4c18-a96a-63b5436f3b7f">
This run varies the material data per instance by randomly-generating
its colour. This is the worst case for batching and that it performs
about the same as `main` is a good thing as it demonstrates that the
batching has minimal overhead when dealing with ~11k visible mesh
entities.

The 1000th frame is identical according to flip:

![flip many_cubes-main-mesh3d-vpi many_cubes-batching-mesh3d-vpi 67ppd
ldr](https://github.com/bevyengine/bevy/assets/302146/ac5f5c14-9bda-4d1a-8219-7577d4aac68c)

```
     Mean: 0.000000
     Weighted median: 0.000000
     1st weighted quartile: 0.000000
     3rd weighted quartile: 0.000000
     Min: 0.000000
     Max: 0.000000
     Evaluation time: 0.4568 seconds
```

### 2D Mesh `bevymark --benchmark --waves 160 --per-wave 1000 --mode
mesh2d`

<img width="1412" alt="Screenshot 2023-09-03 at 23 59 56"
src="https://github.com/bevyengine/bevy/assets/302146/cb02ae07-237b-4646-ae9f-fda4dafcbad4">
This spawns 160 waves of 1000 quad meshes that are shaded with
ColorMaterial. Each wave has a different material so 160 waves currently
should result in 160 batches. This results in a 50% reduction in median
frame time.

Capturing a screenshot of the 1000th frame main vs PR gives:

![flip bevymark-main-mesh2d bevymark-batching-mesh2d 67ppd
ldr](https://github.com/bevyengine/bevy/assets/302146/80102728-1217-4059-87af-14d05044df40)

```
     Mean: 0.001222
     Weighted median: 0.750432
     1st weighted quartile: 0.453494
     3rd weighted quartile: 0.969758
     Min: 0.000000
     Max: 0.990296
     Evaluation time: 0.4255 seconds
```

So they seem to produce the same results. I also double-checked the
number of draws. `main` does 160000 draws, and the PR does 160, as
expected.

### 2D Mesh `bevymark --benchmark --waves 160 --per-wave 1000 --mode
mesh2d --material-texture-count 10`

<img width="1392" alt="Screenshot 2023-09-04 at 00 09 22"
src="https://github.com/bevyengine/bevy/assets/302146/4358da2e-ce32-4134-82df-3ab74c40849c">
This generates 10 textures and generates materials for each of those and
then selects one material per wave. The median frame time is reduced by
50%. Similar to the plain run above, this produces 160 draws on the PR
and 160000 on `main` and the 1000th frame is identical (ignoring the fps
counter text overlay).

![flip bevymark-main-mesh2d-mtc10 bevymark-batching-mesh2d-mtc10 67ppd
ldr](https://github.com/bevyengine/bevy/assets/302146/ebed2822-dce7-426a-858b-b77dc45b986f)

```
     Mean: 0.002877
     Weighted median: 0.964980
     1st weighted quartile: 0.668871
     3rd weighted quartile: 0.982749
     Min: 0.000000
     Max: 0.992377
     Evaluation time: 0.4301 seconds
```

### 2D Mesh `bevymark --benchmark --waves 160 --per-wave 1000 --mode
mesh2d --vary-per-instance`

<img width="1396" alt="Screenshot 2023-09-04 at 00 13 53"
src="https://github.com/bevyengine/bevy/assets/302146/b2198b18-3439-47ad-919a-cdabe190facb">
This creates unique materials per instance by randomly-generating the
material's colour. This is the worst case for 2D batching. Somehow, this
PR manages a 7% reduction in median frame time. Both main and this PR
issue 160000 draws.

The 1000th frame is the same:

![flip bevymark-main-mesh2d-vpi bevymark-batching-mesh2d-vpi 67ppd
ldr](https://github.com/bevyengine/bevy/assets/302146/a2ec471c-f576-4a36-a23b-b24b22578b97)

```
     Mean: 0.001214
     Weighted median: 0.937499
     1st weighted quartile: 0.635467
     3rd weighted quartile: 0.979085
     Min: 0.000000
     Max: 0.988971
     Evaluation time: 0.4462 seconds
```

### 2D Sprite `bevymark --benchmark --waves 160 --per-wave 1000 --mode
sprite`

<img width="1396" alt="Screenshot 2023-09-04 at 12 21 12"
src="https://github.com/bevyengine/bevy/assets/302146/8b31e915-d6be-4cac-abf5-c6a4da9c3d43">
This just spawns 160 waves of 1000 sprites. There should be and is no
notable difference between main and the PR.

### 2D Sprite `bevymark --benchmark --waves 160 --per-wave 1000 --mode
sprite --material-texture-count 10`

<img width="1389" alt="Screenshot 2023-09-04 at 12 36 08"
src="https://github.com/bevyengine/bevy/assets/302146/45fe8d6d-c901-4062-a349-3693dd044413">
This spawns the sprites selecting a texture at random per instance from
the 10 generated textures. This has no significant change vs main and
shouldn't.

### 2D Sprite `bevymark --benchmark --waves 160 --per-wave 1000 --mode
sprite --vary-per-instance`

<img width="1401" alt="Screenshot 2023-09-04 at 12 29 52"
src="https://github.com/bevyengine/bevy/assets/302146/762c5c60-352e-471f-8dbe-bbf10e24ebd6">
This sets the sprite colour as being unique per instance. This can still
all be drawn using one batch. There should be no difference but the PR
produces median frame times that are 4% higher. Investigation showed no
clear sources of cost, rather a mix of give and take that should not
happen. It seems like noise in the results.

### Summary

| Benchmark  | % change in median frame time |
| ------------- | ------------- |
| many_cubes  | 🟩 -30%  |
| many_cubes 10 materials  | 🟩 -5%  |
| many_cubes unique materials  | 🟩 ~0%  |
| bevymark mesh2d  | 🟩 -50%  |
| bevymark mesh2d 10 materials  | 🟩 -50%  |
| bevymark mesh2d unique materials  | 🟩 -7%  |
| bevymark sprite  | 🟥 2%  |
| bevymark sprite 10 materials  | 🟥 0.6%  |
| bevymark sprite unique materials  | 🟥 4.1%  |

---

## Changelog

- Added: 2D and 3D mesh entities that share the same mesh and material
(same textures, same data) are now batched into the same draw command
for better performance.

---------

Co-authored-by: robtfm <50659922+robtfm@users.noreply.github.com>
Co-authored-by: Nicola Papale <nico@nicopap.ch>
@starwolfy
Copy link

starwolfy commented Feb 10, 2024

Does this issue address optimizations for shadow calculations with many 3D meshes (one entity per mesh) combined with a few or more shadow-casting light sources such as point lights and direction lights? My game is severely bottlenecked almost by shadow calculations alone, systems such as batch_and_prepare_render_phase check_light_mesh_visibility queue_shadows write_batched_instance_buffer and the ShadowPassNode on render thread quickly take up many milliseconds.

@superdump
Copy link
Contributor

@starwolfy I/we have been working on a bunch of optimisations for the main queue, sort, batch and prepare, write buffer, render pass nodes flow for mesh entities. I have some improvements for the directional light shadow cascade culling.

@superdump
Copy link
Contributor

@TirushOne - since 0.12 we have had automatic instanced draws of entities with the same mesh and material https://bevyengine.org/news/bevy-0-12/#automatic-batching-and-instancing-of-draw-commands and 0.13 improved the sorting of entities with opaque materials a bit. Further improvements will come to enable instancing in more cases.

@insberr
Copy link

insberr commented Mar 18, 2024

I have been using instancing for a project I am working on and it boosted my performance by a lot. My only problem is that I want some cubes to have transparency. I have no idea how to make the alpha value in the material color work. Even setting the alpha in the shader does nothing. I am using the Bevy instancing example, https://github.com/bevyengine/bevy/blob/main/examples/shader/shader_instancing.rs

How would I go about making alpha work? And while I am at it, it would be nice to add shadows back in.

@pcwalton
Copy link
Contributor

pcwalton commented Apr 8, 2024

@starwolfy This should be significantly faster in 0.14, especially when #12773 lands.

@suprohub
Copy link

Any news?

@IceSentry
Copy link
Contributor

Bevy now does automatic instancing for any combination of mesh and material that are cloned. See the automatic_instancing example for more details. 0.16 will also bring a lot more than just instancing.

@github-project-automation github-project-automation bot moved this from In Progress to Done in Rendering Feb 21, 2025
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
A-Rendering Drawing game state to the screen C-Feature A new feature, making something new possible
Projects
Status: Done
Development

No branches or pull requests