[PATCH] D131562: [AArch64][SME] Document SME ABI implementation in LLVM

Mon Aug 15 08:19:18 PDT 2022

sdesmalen added a comment.

In D131562#3715184 <https://reviews.llvm.org/D131562#3715184>, @tschuett wrote:

> You state that the `vscale` can change during the execution of the program. I would have expected a LangRef update.

Enabling streaming mode means FP/vector code may be executed on a different processing element. This has implications beyond having a different SVE vector length since it also clears register state and changes the set of allowable instructions. For the purpose of LLVM IR it is sufficient to support only a single vscale as long as the input IR is restricted to not pass/return (pointers to) scalable vectors between such functions and any streaming mode changes happen only on function-call boundaries. The SME ACLE has similar restrictions, so Clang emitting these attributes will ensure these restrictions are honoured in the IR.

We'd rather not open the can of worms of supporting multiple vscales because this is not really a capability we need from LLVM. Even if it was a generic capability in LLVM then there would still be other issues we'd need to tackle in a similar way as outlined in this document, so I'm not sure if it's worth making a change to the LangRef.

================
Comment at: llvm/docs/AArch64SME.rst:278
+We also need this for locally streaming functions, where an ``SMSTART`` needs to
+be inserted into the DAG at the start of the function.
+
----------------
efriedma wrote:
> Not sure the CopyToReg+CopyFromReg thing is actually enough to prevent all the transforms you want to prevent... if I'm following correctly, you really need to prevent *any* operations from showing up in the middle of the call sequence, and the SelectionDAG scheduler isn't really set up to enforce that sort of constraint.  Would be more obviously correct to use a pseduo-instruction for the call, and expand it to smstop+call+smstart using a post-isel hook.
> the SelectionDAG scheduler isn't really set up to enforce that sort of constraint
It seems that this is exactly what Glue is meant to do. It tells the SelectionDAG scheduler to put these operations in the same SUnit node, which means they get scheduled together without anything else being scheduled in between. Most of the call-chain is glued together in this way.

> Would be more obviously correct to use a pseduo-instruction for the call, and expand it to smstop+call+smstart using a post-isel hook.
I experimented with that, but found that replacing the ISD::CALL with a pseudo node is not really feasible. There are some complications to this approach:
* At the time of expanding the CALL itself, the operands will already have been lowered to the callconv-defined registers, whereas we need to change PSTATE.SM before that to give the register allocator a chance to spill/reload these registers (while still virtual registers) //before// it moves them to physical registers defined by the cc.
* There is a similar problem after the call; it first needs to move the registers to virtual registers before it can invoke `smstart/smstop`.
* Even if we'd be happy to with the idea of finding an insertion point for smstart/smstop after selection, then we'd need to fiddle with the scheduling afterwards to move the `smstart/smstop` instructions before and after the COPY nodes. For this we need to know which operations are part of the Call, and which are part of the program. But when finding the right insertion point //after// selection, this information has been lost. This gets even more complicated to recognise when values are passed by reference or via the stack.

================
Comment at: llvm/docs/AArch64SME.rst:397
+``aarch64_pstate_sm_body`` or ``aarch64_pstate_sm_compatible`` attributes,
+in order to avoid the use of vector instructions.
+
----------------
efriedma wrote:
> Disabling vectorization might not be enough to completely prevent the use of vector instructions,  but probably close enough.  (For example, we use vector instructions to lower popcount.)
Yes you're right, we'll probably find cases where we need to limit the code-generator to lower things differently when in streaming mode.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D131562/new/

https://reviews.llvm.org/D131562