[llvm] [AArch64]Fix invalid use of ld1/st1 in stack alloc (PR #105518)

Thu Aug 29 09:05:53 PDT 2024

================
@@ -3354,10 +3360,21 @@ bool AArch64FrameLowering::spillCalleeSavedRegisters(
           MachinePointerInfo::getFixedStack(MF, FrameIdxReg2),
           MachineMemOperand::MOStore, Size, Alignment));
       MIB.addReg(PnReg);
-      MIB.addReg(AArch64::SP)
-          .addImm(RPI.Offset) // [sp, #offset*scale],
-                              // where factor*scale is implicit
-          .setMIFlag(MachineInstr::FrameSetup);
+      MIB.addReg(AArch64::SP);
+      if (RPI.Offset >= -8 && RPI.Offset <= 7)
+        MIB.addImm(RPI.Offset);
+      else {
+        // When stack offset out of range for st1b scalar + imm variant,
+        // store offset in register for use in scalar + scalar variant.
+        Register ScratchReg = findScratchNonCalleeSaveRegister(&MBB);
----------------
sdesmalen-arm wrote:

> wouldn't it be better from performance perspective to do it this way

On the face of it I would probably agree that it would be better for performance reasons to use multi-vector stores. But I would only expect it to be necessary for 2 loads/stores at most. In the tests however, I see this being necessary for more loads/stores, which is suspicious.

Looking into this a bit more closely, I notice that with current `main` (i.e. without this PR), the offsets of the spills and fills are already incorrect. For some example:
```
declare void @bar()
define void @foo(<vscale x 4 x i32> %arg) nounwind {
  call void @bar()
  ret void
}
```
When building this with `llc -mattr=+sve2p1 < /tmp/t2.ll -enable-misched=false -enable-post-misched=false`, I see:
```
addvl   sp, sp, #-18
str     p15, [sp, #4, mul vl]           // 2-byte Folded Spill
str     p14, [sp, #5, mul vl]           // 2-byte Folded Spill
...
str     p4, [sp, #15, mul vl]           // 2-byte Folded Spill
ptrue   pn8.b
st1b    { z22.b, z23.b }, pn8, [sp, #4, mul vl] // 32-byte Folded Spill
st1b    { z20.b, z21.b }, pn8, [sp, #8, mul vl] // 32-byte Folded Spill
...
st1b    { z8.b, z9.b }, pn8, [sp, #32, mul vl] // 32-byte Folded Spill
```
Storing to `sp + 32 * VL` is incorrect, because the `addvl sp, sp, #-18` only allocated space for 18 vectors.
Further more, the offsets of the multi-vector stores are scaled by 4, instead of 2.

This is something that needs fixing. I would instead have expected:
```
st1b    { z22.b, z23.b }, pn8, [sp, #2, mul vl] // 32-byte Folded Spill
st1b    { z20.b, z21.b }, pn8, [sp, #4, mul vl] // 32-byte Folded Spill
...
st1b    { z8.b, z9.b }, pn8, [sp, #16, mul vl] // 32-byte Folded Spill
```
in which case only z8 and z9 would need to be spilled using the regular STR instructions, which isn't too bad.

I'm not sure if there is any data available on whether finding an extra scratch register + addvl is worth it in practice, but in the absence of that data I'd probably favour an approach that is lower in complexity.  From a quick look, it seems like the place where it tries to create the paired Z registers could be amended with an extra condition that checks the `ScalableByteOffset` to avoid creating the paired register when the offset > `((14 + 2) * 16)`.

https://github.com/llvm/llvm-project/pull/105518