[PATCH] D131582: [AArch64][SME] Add support for arm_locally_streaming functions.

Fri Sep 16 03:21:18 PDT 2022

sdesmalen added inline comments.

================
Comment at: llvm/lib/CodeGen/SelectionDAG/InstrEmitter.cpp:1162
   case ISD::EntryToken:
-    llvm_unreachable("EntryToken should have been excluded from the schedule!");
   case ISD::MERGE_VALUES:
----------------
efriedma wrote:
> sdesmalen wrote:
> > sdesmalen wrote:
> > > aemerson wrote:
> > > > ?
> > > Perhaps removing this without any alternative `assert` is a little crude, but I wasn't really sure what to test for instead.
> > > 
> > > The change here is that EntryToken now "becomes part of the schedule" because other instructions can be glued to it, even though the node itself isn't schedulable (it is always guaranteed to be the first instruction).  I was hoping someone would have a suggestion here.
> > @efriedma just wanting a second opinion on this, are you aware of anything else I might need to worry about here?
> > 
> > I needed to remove this llvm_unreachable because I changed EntryToken to also have Glue, and when using that it makes the EntryToken part of the schedule (even though the node itself can't be scheduled).
> This feels pretty isolated; the entry token basically has the same meaning during isel, and we're throwing it away after isel.  The only possible side-effect I can think of is that maybe there's something assumes the entry token only has one successor.
> 
> That said, it feels like a sort of strange thing to be inserting smstart/smend this early... would it make sense to insert it as part of prologue/epilogue lowering instead?
I considered that early on but thought doing this at ISEL level would be easier because we get the benefits from the register allocator. Also, we already have the node/mechanisms in place for SelectionDAG, so adding it here seemed sensible.

For some simple example:
```define <8 x i32> @foo(<8 x i32> %x) nounwind "aarch64_pstate_sm_body" {
  %y = add <8 x i32> %x, %x
  ret <8 x i32> %y
}
```

if we were to do this at point of Prologue/Epilogue lowering, we'd have the following input:
```
# *** IR Dump Before Prologue/Epilogue Insertion & Frame Finalization (prologepilog) ***:                                                                                                                                                                                                                                                                                   
# Machine code for function foo: NoPHIs, TracksLiveness, NoVRegs, TiedOpsRewritten, TracksDebugUserValues                                                                                                                                                                                                                                                                   
Function Live Ins: $q0, $q1                                                                                                                                                                                                                                                                                                                                                 

bb.0 (%ir-block.0):                                                                                                                                                                                                                                                                                                                                                         
  liveins: $q0, $q1                                                                                                                                                                                                                                                                                                                                                         
  renamable $q0 = ADDv4i32 killed renamable $q0, renamable $q0                                                                                                                                                                                                                                                                                                              
  renamable $q1 = ADDv4i32 killed renamable $q1, renamable $q1                                                                                                                                                                                                                                                                                                              
  RET_ReallyLR implicit $q0, implicit $q1                                                                                                                                                                                                                                                                                                                                   ```

We'd then first need to do analysis which registers are clobbered by the smstart/smstop, add extra spill slots for the input arguments passed in NEON/FP/SVE registers, and then add the spills/smstart/reloads for the prolgoue, and spills/smstop/reloads for the epilogue. The frame-lowering code is already quite complex. We'd need to end up with something like:

```
foo:                                    // @foo
// %bb.0:
        sub     sp, sp, #96
        stp     d15, d14, [sp, #32]
        stp     d13, d12, [sp, #48]
        stp     d11, d10, [sp, #64]
        stp     d9, d8, [sp, #80]
        stp     q0, q1, [sp]                    // input operands are clobbered by smstart, so spill to stack
        smstart sm
        ldr     q0, [sp]                        // reload the spilled operand register
        add     v2.4s, v0.4s, v0.4s
        ldr     q0, [sp, #16]               // reload the other spilled operand register
        add     v0.4s, v0.4s, v0.4s
        stp     q2, q0, [sp]                 // spill the results, because output operands are clobbered by smstop
        smstop  sm
        ldp     q0, q1, [sp]                 // reload the results
        ldp     d9, d8, [sp, #80]
        ldp     d11, d10, [sp, #64]
        ldp     d13, d12, [sp, #48]
        ldp     d15, d14, [sp, #32]
        add     sp, sp, #96
        ret
```

If we insert the smstart/smstop as part of ISEL the register allocator will insert all the spills and fills for free. And we may have some minor benefits from the scheduler which can rearrange code around the spills/fills for the input arguments.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D131582/new/

https://reviews.llvm.org/D131582