small non-fixed-size bytewise copy is transformed to much slower `memcpy`

a bytewise copy of small but non-constant size with non-aliasing src/dest is transformed by is transformed LoopIdiomRecognize into an intrinsic memcpy. because the size is non-constant, neither InstCombine nor SelectionDAG transform the small copy back into an appropriate series of loads and stores, typically the intrinsic ends up as a call to `memcpy`. for small copies (<8 bytes as a fairly unscientific threshold) the library call is much slower than doing the copy with a short loop or inlined instructions. for size-optimized code, at least for x86 targets, a library call is also just larger.

i noticed this in some Rust ([godbolt](https://rust.godbolt.org/z/8Tf3q1j5G)) but it's pretty apparent with `restrict` arguments in C as well ([clang godbolt](https://clang.godbolt.org/z/eMc36Pfvd)).

it seems like handling dynamic-but-small-sized memcpy is just particularly tricky, so maybe there's not much we can do here. i didn't see an existing issue similar to this, at least...

---

i'm not very familiar with how symbolic information is retained in LLVM. it seems that ideally i could write `if (Size.isNotConstantButSmallerThan(16))` and decide to insert something better than a memcpy library call, but i can't tell if the max trip count of the original loop is retained as a hint on the memcpy size later, or if it's totally lost by virtue of being non-constant.

even then, in some target-specific cases there are specific instruction sequences that are more profitable than a `memcpy` - x86 FSRM (already handled in [x86 SelectionDAG](https://github.com/llvm/llvm-project/blob/82c6eeed08b1c8267f6e92d594c910fe57a9775e/llvm/lib/Target/X86/X86SelectionDAGInfo.cpp#L279-L288)) is the example i know. so i'm not sure that it is _always_ profitable to inline a small-but-dynamic-size memcpy?

i also couldn't figure out if there's a non-constant SDValue might still have range information associated to try anything in X86SelectionDAGInfo.cpp. did i miss a detail, or is SelectionDAG too late in the process to have range information? maybe an appropriate thing here would be a flag on memcpy to hint later that we knew a memcpy's max size is "small"? (and in _that_ case, is "dynamic but low-upper-bound" something LLVM could determine in LoopIdiomRecognize when creating the memcpy in the first place?)

i was hoping to put together a patch to propose too, but as-is i have no idea what an appropriate change would be 😅 hopefully someone has a better idea?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

small non-fixed-size bytewise copy is transformed to much slower `memcpy` #87440

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

small non-fixed-size bytewise copy is transformed to much slower memcpy #87440

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

small non-fixed-size bytewise copy is transformed to much slower `memcpy` #87440