[llvm] [BOLT][AArch64] Enabling Inlining for Memcpy for AArch64 in BOLT (PR #154929)

Fri Aug 29 08:05:21 PDT 2025

yafet-a wrote:

> What you see there is that X9 is saved to the stack before the call to memcpy, and after the call it is reloaded because it is used by function `use`. It's a caller-saved register, and that means that if we now start using X9 as a temp register for the inlined memcpy, we are good.
> 
> Can you add this test-case to your positive tests please? You can keep the other little examples that you have, but I think it would be good to have a bigger test case where you match the whole assembly sequence that includes this caller-saved register behaviour. It would be good if you then add one more, that similarly test the whole sequence but using the FP temp register, that would then cover everything I think.

Thanks for the example. I've been looking at it this afternoon and found that it exposed my missing handling of register aliasing. Your `complex_operation` uses mov w2, #64 but our code was only looking for writes to X2, missing the 32-bit alias W2 that was actually being written to.

I initially tried using the `isMOVW()` helper, but found that it that matches MOVN, MOVK, and other variants that don't directly set the value (e.g., movk w2, #64, lsl #16 only updates part of w2) (feel free to let me know if a negative test would be needed for this). So I went with matching only the direct-assignment variants (MOVZXi/MOVZWi). Originally I was only handling MOVZXi, but your test case lmk that I needed MOVZWi too since mov w2, #64 can be emitted also.

Your test case also made me realise I needed register aliasing detection since even after matching both instruction variants, I still need `WrittenRegs.anyCommon(SizeRegAliases)` bc `getIntArgRegister(2)` returns X2, but the matched instruction might write to W2. Correct me if I am wrong but since W2 and X2 are the same physical register, I would need to detect writes to either variant.

Also I have modified your test very slightly. I changed it to 8 bytes for the integer version to properly test the X9 scenario and created a separate FP version using your integer logic as a template but with SIMD operations for the 64-byte case.

https://github.com/llvm/llvm-project/pull/154929