[llvm] [BOLT][AArch64] Enabling Inlining for Memcpy for AArch64 in BOLT (PR #154929)

Fri Aug 22 09:02:40 PDT 2025

================
@@ -1866,8 +1866,32 @@ Error InlineMemcpy::runOnFunctions(BinaryContext &BC) {
         const bool IsMemcpy8 = (CalleeSymbol->getName() == "_memcpy8");
         const bool IsTailCall = BC.MIB->isTailCall(Inst);
 
+        // Extract the size of thecopy from preceding instructions by looking
+        // for writes to the size register
+        std::optional<uint64_t> KnownSize = std::nullopt;
+        BitVector WrittenRegs(BC.MRI->getNumRegs());
+
+        // Get the size register (3rd arg register, index 2 for AArch64)
----------------
yafet-a wrote:

The intention was the architecture-specific dispatching should happen at the MCPlusBuilder level, not the pass level. The `InlineMemcpy` pass was intended to be architecture-agnostic, with each architecture's `MCPlusBuilder` handling its own implementation details through virtual method overrides.

I added a new virtual method [`createInlineMemcpy(bool ReturnEnd, std::optional<uint64_t> KnownSize)` in MCPlusBuilder.h (lines 1898-1904)](https://github.com/yafet-a/llvm-project/blob/users/yafet-a/inlining-memcpy/bolt/include/bolt/Core/MCPlusBuilder.h#L1898-L1904) with a default fallback implementation:

```cpp
virtual InstructionListType createInlineMemcpy(bool ReturnEnd, 
                                               std::optional<uint64_t> KnownSize) const {
  // Default implementation ignores KnownSize and uses original method
  return createInlineMemcpy(ReturnEnd);
}
```
This meant that:
- **X86**: Uses the default fallback, and ignores the `KnownSize` parameter because `REP MOVSB` is a single instruction that reads the size from `RCX` at runtime. it doesn't need compile-time size knowledge and can continue working as it was untouched
- **AArch64**: [Overrides the method in AArch64MCPlusBuilder.cpp (lines 2620)](https://github.com/yafet-a/llvm-project/blob/users/yafet-a/inlining-memcpy/bolt/lib/Target/AArch64/AArch64MCPlusBuilder.cpp#L2620) to use the `KnownSize` for generating optimal width-specific load/store sequences

However, you make an good point because the size extraction logic I added in [`BinaryPasses.cpp (lines 1869-1890)`](https://github.com/yafet-a/llvm-project/blob/users/yafet-a/inlining-memcpy/bolt/lib/Passes/BinaryPasses.cpp#L1869-L1890) is indeed AArch64-specific:

```cpp
// Extract size from preceding instructions (AArch64 only)
// Pattern: MOV X2, #nb-bytes; BL memcpy src, dest, X2  
if (BC.isAArch64()) {
  MCPhysReg SizeReg = BC.MIB->getIntArgRegister(2);  // X2 on AArch64
  BC.MIB->extractMoveImmediate(Inst, SizeReg);       // MOVZXi instruction
}
```

I've added an explicit early return for clarity,  although technically the virtual method fallback handles X86 correctly anyway which is why the test. This makes the architecture-specific behavior **explicit and self-documenting**

https://github.com/llvm/llvm-project/pull/154929