[PATCH] D147800: [SystemZ] Enable MachineCombiner for FP reassociation.

Fri Apr 7 11:36:49 PDT 2023

jonpa created this revision.
jonpa added a reviewer: uweigand.
Herald added subscribers: pengfei, hiraditya.
Herald added a project: All.
jonpa requested review of this revision.
Herald added a project: LLVM.
Herald added a subscriber: llvm-commits.

Have been experimenting with enabling MachineCombiner.

It seemed very likely better to me to let the MachineCombiner work on reg/reg opcodes as it didn't work at all when the input had ADBs. It is expecting opcodes to be the same LHS/RHS, and of course that didn't work when the RHS was a memory operand.

To do this I tried to first select all FP64 adds directly to WFADB always. The MachineCombiner now could do its work, and after that the Peephole optimizer will help with folding the loads again to ADB:

main <> patched: isel + peep of f64 adds, machine combiner disabled.

  adb            :                10796                 9057    -1739
  ...
  wfadb          :                10733                11691     +958
  adbr           :                 4345                 5203     +858
  ...

  Spill|Reload   :               642581               641631     -950
  Copies         :              1011928              1011420     -508

Most of the loads are folded, but not all (84%). Spilling/copies however improves, so not sure if it's worth further effort to fold everything, or even if that's better.

Preliminary benchmarking shows 20% (!) improvement on LBM - twice as much as with the SLP vectorizer (confirmed with a preliminary "full" run also).

Slight improvement on namd and imagick (~1%), without any reassociation, just with the add reg/mem change, which makes it seem like this worked out pretty well.

A complication, though, with this: changing VL64;WFADB -> ADB introduces a CC clobbering. Scanning from the WFADB and forward many times lead to a quick termination when a new CC def was encountered. There were however also many longer searches.

Measuring compile time for the Peephole pass shows a slight average increase, with a bad worst case or two:

main <> patch (without running machine combiner)

  Num stats: 3465         Num stats: 3465
  Average Wall: 0.49%   | Average Wall: 0.50%
  Wall %   Count          Wall %   Count
  4.0      1            | 9.3      1
  1.5      1            | 4.3      1
  1.4      4            | 3.3      1
                        > 2.7      1
                        > 1.5      2
                        > 1.4      5
  1.3      4              1.3      4
  1.2      3            | 1.2      2
  1.1      4            | 1.1      6
  1.0      10           | 1.0      11
  0.9      58           | 0.9      69
  0.8      152          | 0.8      148
  0.7      270          | 0.7      281
  0.6      586          | 0.6      611
  0.5      894          | 0.5      921
  0.4      842          | 0.4      801
  0.3      480          | 0.3      460
  0.2      139          | 0.2      121
  0.1      14           | 0.1      10

This might be acceptable perhaps.

Selecting WFADB (the real instruction, not a pseudo) directly in Select() lead to one file having a few cases of LOC instructions moved around in the block crossing the places where ADB would no longer be possible. Just in one file, but a dozen or so times inside a loop, so not quite ideal. Using an isel pseudo that clobbers CC remedied this as it was the isel scheduler that were causing this difference. With this, there are no cases of CC clobbering that prevents optimizeLoadInstr() on SPEC, but the scan to make sure still has to be made of course.

- One idea I tried was to simply add the CC def to WFADB, which then would be trivially replacable with ADB. It did cause some changes in ~50 files, but not sure if that would make a difference.

- A compromise would be to have a WFADB_CC pseudo live all the way through optimizeLoadInst() and then convert to the real opcode before scheduling. That would however require a new pass it seems, as now *all* of them has to be handled, while optimizeLoadInst() only sees the candidates for the load folding. If that new pass would have LiveIntervals available, the scan for CC would not be necessary as that query should now be available.

In short, this seems promising performance-wise even compared to the SLP vectorizer generating reductions, but I'm not quite sure which way to handle the CC clobbering problem is the best fit.

https://reviews.llvm.org/D147800

Files:
  llvm/include/llvm/CodeGen/TargetInstrInfo.h
  llvm/lib/Target/SystemZ/SystemZISelDAGToDAG.cpp
  llvm/lib/Target/SystemZ/SystemZISelLowering.cpp
  llvm/lib/Target/SystemZ/SystemZInstrFormats.td
  llvm/lib/Target/SystemZ/SystemZInstrInfo.cpp
  llvm/lib/Target/SystemZ/SystemZInstrInfo.h
  llvm/lib/Target/SystemZ/SystemZInstrVector.td
  llvm/lib/Target/SystemZ/SystemZTargetMachine.cpp
  llvm/lib/Target/X86/X86InstrInfo.cpp
  llvm/lib/Target/X86/X86InstrInfo.h
  llvm/test/CodeGen/SystemZ/fp-add-reassoc-01.ll

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D147800.511746.patch
Type: text/x-patch
Size: 14864 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20230407/7e26b59f/attachment.bin>