[llvm] [AArch64][SME] Remove unused ZA lazy-save (PR #81648)

Mon Apr 22 04:23:18 PDT 2024

================
@@ -1121,6 +1121,43 @@ bool AArch64ExpandPseudo::expandMI(MachineBasicBlock &MBB,
   default:
     break;
 
+  case AArch64::STORETPIDR2: {
----------------
sdesmalen-arm wrote:

This wasn't really what I had in mind when I asked to create two pseudo nodes (one for initialising TPIDR2_EL0 and one for allocating the ZA buffer). From what I understand, this PR currently implements the following:

* In SelectionDAG it inserts a `ExpandZABuffer` *machine* node (pseudo instruction), which takes/returns no arguments. This is inserted when lowering the function start (LowerFormalArguments)
* After SelectionDAG, in the FinalizeISel pass, this pseudo instruction is expanded into a RDSVL * RDSVL, and then two other pseudo instructions (STORETPIDR2 and STACKALLOC).
* These pseudo instructions are then later expanded in the AArch64ExpandPseudoInsts pass.

The code relies on `TPIDR2Obj` in MachineFunctionInfo, but doesn't really express the dependences in the DAG or in the MachineIR. I think there is little value in having separate STORETPIDR2 and STACKALLOC pseudo nodes if you don't also model the data in the DAG as well, otherwise you might as well emit all the (expanded) instructions in one go in the `EmitExpandZABuffer` function.

At this point, I wonder if it's simpler to revert to the previous approach where it did:
* Emit ExpandZABuffer node
* in FinalizeiSel pass, expand this to dynamic allocation and initialisation of TPIDR2 object.

I think a more preferred approach would be something where we'd have the following (ISD and Pseudo) nodes:
* One to allocate the buffer and initialise the TPIDR2 object
* Another one to deallocate the buffer (this is something we're currently missing, since we're relying on the function's epilogue code to deallocate the buffer, which I'm not entirely sure is safe).

The first node could take the number of bytes to allocate (SVL * SVL) and have the frame-index to the TPIDR2 object as arguments, and returns the value of SP before the buffer. After expansion, this allocates the buffer and stores it to the given frame-index.
The second node would take the original SP as input and would simply copy the original SP -> SP.

If there are no uses of the TPIDR2 object we'd then remove the pseudos from the MIR.

We could then consider to further split the first node into two nodes (one for creating the buffer and one for initialising the TPIDR2 object). But it's all a bit difficult to say what the best way forward is without trying it.

https://github.com/llvm/llvm-project/pull/81648