Direct copy specialization #65

james7132 · 2024-03-02T21:48:34Z

Mostly resolves the issues found in #53. This PR specializes writes, reads, and creates of types by directly using slice::copy_from_slice on them if compiling for a little endian target, and only if the type has no internal padding.

Checking the generated assembly, this eliminates a huge number of branches and even starts using vectorized copies and calls to memcpy directly for larger types.

@james7132

`StorageBuffer`, and make `GpuArrayBuffer` use it. `EncasedBufferVec` is like `BufferVec`, but it doesn't require that the type be `Pod`. Alternately, it's like `StorageBuffer<Vec<T>>`, except it doesn't allow CPU access to the data after it's been pushed. `GpuArrayBuffer` already doesn't allow CPU access to the data, so switching it to use `EncasedBufferVec` doesn't regress any functionality and offers higher performance. Shutting off CPU access eliminates the need to copy to a scratch buffer, which results in significantly higher performance. *Note that this needs teoxoy/encase#65 from @james7132 to achieve end-to-end performance benefits*, because `encase` is rather slow at encoding data without that patch, swamping the benefits of avoiding the copy. With that patch applied, and `#[inline]` added to `encase`'s `derive` implementation of `write_into` on structs, this results in a *16% overall speedup on `many_cubes --no-frustum-culling`*. I've verified that the generated code is now close to optimal. The only reasonable potential improvement that I see is to eliminate the zeroing in `push`. This requires unsafe code, however, so I'd prefer to leave that to a followup.

james7132 · 2024-03-28T16:03:14Z

I think this is about as clean/low-unsafe as I can get this without negatively impacting the performance again. @teoxoy this should be ready for review now.

teoxoy · 2024-04-02T19:50:54Z

@james7132 I pushed a commit to the PR could you give it a look?

james7132 · 2024-04-02T20:10:33Z

I actually tried to deduplicate code like this, but apparently the optimizer doesn't like implicit if-returns as replacements for if/elses. I'll check the assembly output of this later.

james7132 · 2024-04-03T02:07:04Z

I think the Pod changes are fine, but the cleanup in how the branching is handled the compiler really doesn't like: james7132/encase_asm_tests@f936005#diff-ce2d3c1fb388fe83071754963a5343392f61ff8b7d7465aa6d2ba68dfe31993b.

This shows a regression in codegen for Vec-like writes. Though I think this comes down to the fact that it cannot inline the reservation, and thus it's assertion that there's enough space does not carry over, unlike the slice writes.

james7132 · 2024-04-03T02:10:24Z

but apparently the optimizer doesn't like implicit if-returns as replacements for if/elses. I'll check the assembly output of this later.

Ah I think I see why, it views the non-memcpy writes/reads as a potential side effect, and thus cannot treat the return as a replacement for an else to the if.

teoxoy · 2024-04-14T10:36:00Z

@james7132 I pushed another commit to address the issue. Let me know what you think!

james7132 · 2024-04-21T04:05:52Z

Doesn't seem to make a tangible difference: james7132/encase_asm_tests@4399891. Might be due to one of the is_pod changes. It might be better to just merge this as is and close the gap in a later PR.

james7132 added 7 commits March 2, 2024 13:06

Specialize array and vector copies to use memcpy

2d5a5b6

Naively assume derived types have internal padding

f0b34c4

Specialize matricies

35c9b7c

Simplify vectors

39f9b9a

Specialize Vector types

1c4b8b7

Fix hygiene

6af8066

More hygeine fixes

5921736

james7132 marked this pull request as ready for review March 23, 2024 05:35

pcwalton mentioned this pull request Mar 23, 2024

Add EncasedBufferVec, an higher-performance alternative to StorageBuffer, and make GpuArrayBuffer use it. bevyengine/bevy#12670

Closed

james7132 added 7 commits March 23, 2024 16:44

Inline struct reads/writes/creates

fe8028f

Specialize ReadFrom

1102451

Specialize CreateFrom

ad18fc5

Try to specialize non-padded columns

06af983

Inline Vec::try_extend_zeroed

71a0cbd

Cleanup unsafe for vectors

b31fe1b

Cleanup unsafe in Matrix impls

4ee8377

james7132 force-pushed the direct-copy-specialization branch from 2901ec1 to 4ee8377 Compare March 28, 2024 04:12

james7132 mentioned this pull request Mar 30, 2024

Renderer optimization tracking issue bevyengine/bevy#12590

Open

19 tasks

use the direct copy fast path only for POD types + other tweaks

f710173

use a macro

ada1631

teoxoy merged commit 83e744a into teoxoy:main Apr 24, 2024
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Direct copy specialization #65

Direct copy specialization #65

james7132 commented Mar 2, 2024 •

edited

Loading

james7132 commented Mar 28, 2024

teoxoy commented Apr 2, 2024

james7132 commented Apr 2, 2024

james7132 commented Apr 3, 2024

james7132 commented Apr 3, 2024

teoxoy commented Apr 14, 2024

james7132 commented Apr 21, 2024

Direct copy specialization #65

Direct copy specialization #65

Conversation

james7132 commented Mar 2, 2024 • edited Loading

james7132 commented Mar 28, 2024

teoxoy commented Apr 2, 2024

james7132 commented Apr 2, 2024

james7132 commented Apr 3, 2024

james7132 commented Apr 3, 2024

teoxoy commented Apr 14, 2024

james7132 commented Apr 21, 2024

james7132 commented Mar 2, 2024 •

edited

Loading