[llvm] [SystemZ] Enable MachineCombiner for FP reassociation (PR #83546)

Fri Apr 19 08:24:02 PDT 2024

JonPsson1 wrote:

> Instead of adding a CC clobber first and then removing it, did we try making optimizeLoadInstr only fold the load if CC is not live at this point? This is what foldMemoryOperandImpl does ... Would this have worse performance?

I agree that now that the general reg/reg scheme doesn't seem worthwhile, it may not be worth using the CC-pseudos and the extra pass to convert them back. I tried to do the CC check in optimizeLoadInstr() instead of using CC-pseudos. Performance-wise I see no difference, and doing it this way only misses a few foldings, which seem acceptable:

```
"CC pseudos" <-> "CC check"

seb            :                 4457                 4307     -150
adb            :                 9180                 9033     -147
sdb            :                 4336                 4275      -61
aeb            :                 6626                 6574      -52
mdb            :                 8717                 8715       -2
```

The backwards search for CC-def should be ok compile-time-wise as many instructions clobber CC, but if there emerges an issue, Peephole-opt could probably be extended easily to maintin a LivePhysRegs tracker. The good thing is that we can trust the MBB live-in list for CC.

> Talking about foldMemoryOperandImpl, it's a bit weird that there are three different target hooks to fold a load: optimizeLoadInstr, foldMemoryOperandImpl with FrameIndex, and foldMemoryOperandImpl with LoadMI.

I tried your suggestion to have optimizeLoadInstr() call foldMemoryOperand(), which in turn calls foldMemoryOperandImpl().  This was NFC on SPEC, and the only test changes I saw was that in SystemZ anyregcc.ll and stackmap.ll now fails. I am not sure about these, but it seems that foldMemoryOperand() optimizes these in foldPatchPoint and the result seems to be that some of the loads are folded into those instructions. I have not updated those tests yet as I am not sure if this is correct / desired on SystemZ (?).

Since foldMemoryOperand() sets the memoperands on the returned MI, it wouldn't work to call foldMemoryOperandImpl() directly from optimizeLoadInstr(). So alternatives then are to do these foldings in optimizeLoadInstr() as in HEAD^, or do it this way, possibly stopping on STACKMAP etc in optimizeLoadInstr() if that is not desired.

It seems slightly more extensible to have foldMemoryOperandImpl() do it, but maybe it doesn't really matter.

> X86 also implements an unfoldMemoryOperand callback. I was wondering if this would allow reassociation after creating reg/mem instruction, but it looks like MachineCombiner never even uses that callback. Still, would we see any benefit from implementing it? (If it's unrelated to the combiner, that is of course a separate topic.)

It seems that LICM can do this to hoist the memory access out of the loop, but that will also make the register live across the loop, so not sure how useful it would be. I Haven't tried it.

https://github.com/llvm/llvm-project/pull/83546