[llvm] AArch64: Optimize memmove for non-power-of-two sizes (PR #168633)

Sun Dec 14 13:37:41 PST 2025

osamakader wrote:

memmove doesn't work the same way as memcpy because of how the generic SelectionDAG path handles volatile operations.

memcpy (non-volatile): In getMemcpyLoadsAndStores (SelectionDAG.cpp), MemOp::Copy uses the actual isVol parameter, so allowOverlap() == true for non-volatile cases. This allows:
Overlapping loads/stores optimization (from findOptimalMemOpLowering)
Chaining optimization (from chainLoadsAndStoresForMemcpy, fixed in PR 168890)

memmove (always treated as volatile): In getMemmoveLoadsAndStores (SelectionDAG.cpp:8799), MemOp::Copy is hardcoded to IsVolatile=true, so allowOverlap() == false. This prevents:
Overlapping loads/stores optimization in the generic path
The generic path falls back to mixed-size operations (e.g., i32 + i16 + i8 for 7 bytes)
Since memmove is hardcoded to volatile in the generic path, it can't use overlapping loads/stores. 

Our AArch64-specific implementation:
Uses overlapping loads/stores (e.g., two i32 operations for 7 bytes instead of i32 + i16 + i8)
Provides better codegen for non-power-of-two sizes.

https://github.com/llvm/llvm-project/pull/168633