[PATCH] D86519: [SystemZ] New pass for domain reassignment from integer to vector.

Tue Dec 1 04:09:44 PST 2020

jonpa updated this revision to Diff 308612.
jonpa added a comment.

Improvements since last update:

- Allow reassigned registers around calls by keeping them in a callee-saved FP lane.

While working on this, it seemed wrong to either add anew pseudo callee-saved vector physregs or to change the regmask on the call instruction. It also seemed like a lot of work to have duplicate opcodes with FP reg operands - e.g. VAG_FP64, VAG_FP32. Instead, COPYing FP64 out of the defined (reassigned) vector reg and then back into a new vector reg before each user is what was tried.

Copying into FP64 subregs is somewhat problematic: the register coalescer typically simply rewrites that back again to the full vector register, unless those COPYs are specifically checked for and handled in SystemZRegisterInfo::shouldCoalesce(), which was added.

Due to this, I saw more LDRs in the output, which was partially remedied by adding regalloc hints for those cases. That helped in the simple case of one user, but with more users, unfortunately two FP64 regs were used: FP8D was COPYed into V8 at the first use, but for the second use FP9D was now kept since V8 overlaps with FP8D and regalloc is not aware that V8 is only used, it seems. Due to this, only the case of a single user is currently handled.

In order to make this work a new VF128Saved regclass containing V8-V15 was needed, to keep the FP64 regs in the callee-saved ones.

A drawback is that now the function has to save/restore those FP regs even in small functions with no other vector registers in use.

I also added a tuning parameter for this ("vecsavedlim") to avoid increased spilling due to VF128Saved registers if doing too much of this.

The static effect of this was a rise in the number of reassigned closures from 8000 to 9000. 16000 closures were now not marked illegal due to calls.

For this to work, all the reassigned regs (around calls) have to be in the FP lane. Out of the closures rejected due to calls, there were some closures with this reason (15%), but the main problem was when the defining instruction was outside of the loop with a call in the closure loop (+50%). In that case the LDR would remain in the loop if reassigned. That may still be beneficial, but hasn't been tried yet (see "To try next" below).

- Vector Element Compare

After the experiments with Vector Test under Mask proved to be fruitless, I replaced that to instead use VEC whenever possible (lane G0 / F1 <https://reviews.llvm.org/F1>). Given that comparisons are typically used before branches and are optimized in SystemZEliminateCompares, and since reassigned arithmetic instructions (e.g. VAG) cannot be used for comparison elimination like scalars (e.g. AGR), I have not tried running this on benchmarks. The idea here was mainly to eliminate VLGVs (extractions) already present in the code after isel. With TM, there were +2000 eliminated (however not giving any speedup), but with VEC only 145 were eliminated so far...

- Experiments on benchmarks:

I added options to add different types of converters in order to be more selective about what type of instructions are being converted. With -allconv=false, only loads and stores are converted. With -regreg, register-register instructions are also converted, e.g. AGRK -> VAG, and so on.

In order to maximize the saved spilling/reloading, I started experimenting with the register pressure limits and available conversions. The result was a bit surprising...

I started with just load/store conversions (which is what I initially believed would be a simple and good improvement). I could improve a bit on the initial estimates of gprlim=16 ("when this many gpr regs are live, try to reassign a closure"), veclim=28 ("reassign as long as no more than this many vector regs are live"), vecsavedlim=("reassign around calls using fp-lanes as long as no more than this many vector regs are live").

(gprlim / veclim / vecsavedlim)
16 / 28 / 16 : 1305 less spill/reload instructions.
20 / 40 / 12 : 1571 less...

I could not get it better than that, so I then added "reg-reg" converters:

16 / 40 / 12 : -9078

At this point it was also clear that the trick of inserting scalars that had no converter played a key here, and some 8000 move VLVG instructions were present. Without doing insertions, the save in spilling did not happen.

Adding just "shifts" without "reg-reg" actually made things worse, but with "reg-reg" + "shifts" I got

20 / 40 / 12 : -9478

+regreg +shifts +regexts, was for some reason much worse:

20 / 40 / 12 : -3118

+regreg +shifts +fpconv, however was a nice improvement, where it also seemed right to limit the number of scalar insertions per closure to 1 instead of 4 since that gave similar result with less VLGVs:

20 / 40 / 12 / 4   : -13468   (~ +8100 VLVGs)
20 / 40 / 12 / 1   : -12800   (~ +5400 VLVGs)

I was getting ready to add the reg-imm and reg-mem converters, but they did not seem to improve the way I was expecting:

+regreg +shifts +fpconv +reg-imm:
20 / 40 / 12 / 4   : -13614   (~ +8400 VLVGs)

+regreg +shifts +fpconv + reg-mem:
20 / 40 / 12 / 1  :   -3724   (~ +3000 VLVGs)         !

+regreg +shifts +fpconv +reg-imm +reg-mem:
20 / 40 / 12 / 1   : -3827   (~ +3500 VLVGs)         !

reg-imm was improving things just a little, but not enough to motivate the needed handling to support it... And it certainly seems that it is not helping register pressure to reassign reg-mem instructions. I don't know exactly yet why that is, but for some reason it seems to work better to instead do a VLVG after the scalar reg-mem instruction.

Adding compare conversions did not seem to save any register pressure either, on the contrary... It also gave more insertions, so maybe this means that a scalar value was inserted to only then be used in a comparison, which may not really help register pressure...

+regreg +shifts +fpconv +compares
20 / 40 / 12 / 1   : -8200    (~+10000 VLVGs)

LoadAddress conversions also did not seem to be good at all here.

This meant that the best setting I have so far is "+regreg +shifts +fpconv", which saves ~13000 spill/reload instructions, which is about 1.8% (there was also (currently) some more register moves: +1000, or 0.1%, which I haven't looked into more.)

I tried this over-night:

master vs "+regreg +shifts +fpconv":

Improvements:
0.981: f538.imagick_r
0.995: f508.namd_r

Regressions:
1.033: i523.xalancbmk_r
1.016: i500.perlbench_r
1.015: i505.mcf_r
1.010: i520.omnetpp_r
1.009: f526.blender_r
1.009: i541.leela_r
1.007: f507.cactuBSSN_r
...

master vs "+regreg +shifts +fpconv +immloads":
Improvements 2017_dev_loopstats_with_immloads:
0.985: i541.leela_r
0.996: f538.imagick_r

Regressions 2017_dev_loopstats_with_immloads:
1.013: i520.omnetpp_r
1.012: i505.mcf_r
1.010: f526.blender_r
1.010: i500.perlbench_r
1.007: f519.lbm_r
1.006: i523.xalancbmk_r
...

These (and a few more variations inluding only "reduce VLGVs") showed nothing but mixed results :-/

For the functions that got a different spill count with  "+regreg +shifts +fpconv, 20 / 40 / 12 / 1", I made a graph: F14475709: DomainReassignment.png <https://reviews.llvm.org/F14475709> It shows that there are cases that are actually getting more GPR spilling (to the right), and also some VEC spilling (to the left)...

Ideas to try next:

  calls handling:
   - Allow LDR in loop (don't reject closures with a def before loop and call in loop).
   - Allow multiple users (gives more LDRs).

  Try to be more agressive in big regions: extractions, comparisons, ..?  If the liveness is long, it may not be so bad to do an extraction...

  Try to limit the number of insertions somehow when they are not useful. There are quite a lot of VLGVs added, and I suspect some of those closures using insertion is not really helping register pressure - especially when the register is killed soon after the insertion.

  Try to optimize settings for innermost loops. So far only values for whole functions have been used for finding best tuning.

  Look more into worst-cases (see graph) for further improvements and tuning.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D86519/new/

https://reviews.llvm.org/D86519

Files:
  llvm/lib/CodeGen/TargetInstrInfo.cpp
  llvm/lib/Target/SystemZ/CMakeLists.txt
  llvm/lib/Target/SystemZ/SystemZ.h
  llvm/lib/Target/SystemZ/SystemZDomainReassignment.cpp
  llvm/lib/Target/SystemZ/SystemZInstrInfo.cpp
  llvm/lib/Target/SystemZ/SystemZInstrInfo.h
  llvm/lib/Target/SystemZ/SystemZInstrVector.td
  llvm/lib/Target/SystemZ/SystemZRegisterInfo.cpp
  llvm/lib/Target/SystemZ/SystemZRegisterInfo.td
  llvm/lib/Target/SystemZ/SystemZTargetMachine.cpp
  llvm/test/CodeGen/SystemZ/buildvector-00.ll
  llvm/test/CodeGen/SystemZ/dag-combine-01.ll
  llvm/test/CodeGen/SystemZ/dag-combine-03.ll
  llvm/test/CodeGen/SystemZ/domain-reassignment-01.ll
  llvm/test/CodeGen/SystemZ/domain-reassignment-02.ll
  llvm/test/CodeGen/SystemZ/domain-reassignment-03.ll
  llvm/test/CodeGen/SystemZ/domain-reassignment-04.ll
  llvm/test/CodeGen/SystemZ/domain-reassignment-05.ll
  llvm/test/CodeGen/SystemZ/domain-reassignment-06.ll
  llvm/test/CodeGen/SystemZ/domain-reassignment-07.ll
  llvm/test/CodeGen/SystemZ/domain-reassignment-08.ll
  llvm/test/CodeGen/SystemZ/domain-reassignment-09.ll
  llvm/test/CodeGen/SystemZ/domain-reassignment-11.ll
  llvm/test/CodeGen/SystemZ/domain-reassignment-12.ll
  llvm/test/CodeGen/SystemZ/domain-reassignment-13.ll
  llvm/test/CodeGen/SystemZ/domain-reassignment-14.ll
  llvm/test/CodeGen/SystemZ/domain-reassignment-15.ll
  llvm/test/CodeGen/SystemZ/knownbits.ll
  llvm/test/CodeGen/SystemZ/stack-clash-protection.ll
  llvm/test/CodeGen/SystemZ/subregliveness-04.ll
  llvm/test/CodeGen/SystemZ/tls-08.ll
  llvm/test/CodeGen/SystemZ/vec-trunc-to-i1.ll

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D86519.308612.patch
Type: text/x-patch
Size: 262171 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20201201/7d9ff211/attachment-0001.bin>