Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

CpuBufferPool slower than glium #1434

Closed
KeyboardDanni opened this issue Nov 7, 2020 · 3 comments · Fixed by #2076
Closed

CpuBufferPool slower than glium #1434

KeyboardDanni opened this issue Nov 7, 2020 · 3 comments · Fixed by #2076

Comments

@KeyboardDanni
Copy link

KeyboardDanni commented Nov 7, 2020

Issue

When using vulkano to draw instanced quads, the overhead for each draw is actually larger than glium, defeating the whole point of using vulkano in the first place.

I need low draw call overhead for my 2D sprite-based engine for scenarios where ordered draws involve lots of texture changes, as these can't be batched.

For each draw operation I do the following:

let chunk = self.instance_buffer.chunk(self.instance_buffer_src.clone()).expect("Failed to allocate buffer chunk");
let vertex_slice = self.vertex_buffer.clone();

builder.draw_indexed(self.pipeline.clone(), &self.dynamic_state,
                     vec![vertex_slice, Arc::new(chunk)],
                     self.index_buffer.clone(), (), ()).expect("Failed to draw buffer");

The problem is that 1. chunk() seems to be performing a lot of allocations, or is otherwise taking a long time to figure out which chunk to use and whether to allocate, and 2. I have to create a new Arc for every call to draw_indexed(). If I remove batching, callgrind reports 23.97% time spent in vulkano::buffer::cpu_pool::CpuBufferPool<T,A>::try_next_impl and 49.89% time spent in __memcpy_avx_unaligned_erms which is being called from core::ptr::drop_in_place'2 which seems to be coming from dropping the Arc. Without this overhead I suspect vulkano would be quite fast, but right now it's blocking me from working on the rest of this renderer until this bottleneck is resolved.

I tried to do buffering myself using a Vec of CpuAccessibleBuffer objects, but I ran into #1429 and #1433 while trying to implement this.

Any help on this is greatly appreciated.

@KentaTheBugMaker
Copy link

if you want draw 2d sprite try use blit image or copy image and 1 quad plane

@KeyboardDanni
Copy link
Author

if you want draw 2d sprite try use blit image or copy image and 1 quad plane

I need to be able to apply transforms, blending, and shaders, so this is a no-go. Additionally, I would have to issue a separate command for every single draw operation, which I don't think would scale well to ~400k sprites. I need to be able to fill an 8k instance buffer so that I can give each sprite a different transform, alpha, etc. within the shader and still have everything be fast, and draw all that at once.

@Rua
Copy link
Contributor

Rua commented Jan 23, 2021

I recently noticed when profiling my code that try_next_impl is taking up an awful lot of time. That seems to be connected to this issue.

# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants