[PATCH] D44092: [SystemZ] Improve side steering of FPd unit and FXU registers.

Mon Jan 11 11:56:18 PST 2021

jonpa updated this revision to Diff 315875.
jonpa edited subscribers, added: Andreas-Krebbel; removed: MatzeB.
jonpa added a comment.
Herald added subscribers: llvm-commits, hiraditya.
Herald added a project: LLVM.

This patch has been improved to make use of B2B information. B2BW, B2BR, and B2BRW FUs have been added to the SchedModel so that instructions can be modeled to use these. B2BRW is not really needed, but I tried using it for readability. This is one way of keeping track of which instructions can read and/or write B2B - a disadvantage is that the enum for the ProcResources is not available from TableGen so that has been added locally instead for now. It looked like there was probably enough irregularity among the opcodes to motivate this approach - although the differences between subtargets were very small.

One open question is what happens if one instruction defines a high register on one side, and another instruction defines the low part on the other side? Should these subregs be tracked separately, or is it always the full (64-bit) reg that is written? (What about 128bit?)

I revisited the question from before about whether the assumption that the first instruction in the MBB really begins a new decoder group or not. A naive estimate can be made by categorizing the type of incoming edges and their relative frequencies (this ignores probabilites):

- In these cases, the assumption is correct (the MBBs are scheduled in linear order, so a not scheduled pred will also be taking a branch):

  Multiple predecessors, incoming Taken Branch: 29%
  Multiple predecessors, block not scheduled: 10%
  Single predecessor, sched-state known, taken branch: 9%
  Multiple predecessors, linear pred ends group: 8%
  Entry-blocks: 3%
  Single predecessor, not scheduled : 2%

- These edges (blocks) simply continue as before, right or wrong:

  Single predecessor, sched-state known, linear pred: 31%

- These edges mean that the scheduler is wrong:

  Multiple predecessors, linear pred has 1 in group: 6%
  Multiple predecessors, linear pred has 2 in group: 2%

In summary,
61% of the incoming edges known to lead to correct scheduling.
8% of the incoming edges are known to lead to an unmodelled group offset in the scheduler.
31% are continuing from before, which means that they should not change the ratio, which then is actually 88% vs 12%.

As soon as a cracked instruction is scheduled, the scheduler is right again after that point, so the above might be seen as a bit pessimistic.

So it seems that in 9 out of 10 times the scheduler is right in this assumption generally, even though this does not take into account the actual hotness of the edges. And even if the grouping is off, there is still a chance for the bypass if scheduled next to each other:

  [x _ x]  90%
  [_ x x]  97%
  [x x _]  93%
  [x _ _][_ _ _]
  [x _ _] 100%

The next question is how effective the scheduler is in actually producing a schedule that puts B2B reads on the same side as their B2B writes under the assumption that it can track the current decoder slot. Without any particular heuristic, this should by chance be 50/50 - with a random schedule half of the reads end up on the right side.

Without the the B2B side-steering enabled during node selection:

  B2B reads with good schedule: ~8%   (58% ratio)
  B2B reads with bad schedule : ~6%

With side-steering of B2B reads only (-sidesteer-fxu):

  B2B reads with good schedule: ~9%   (71% ratio)
  B2B reads with bad schedule : ~4%

With side-steering of B2B reads and writes (-sidesteer-fxu -sidesteer-lastslot):

  B2B reads with good schedule: ~10%  (75% ratio)
  B2B reads with bad schedule : ~3%

This shows an improvement  with a higher ratio of good B2BR scheduled nodes.

The bypass heuristic is used with a lower priority than grouping or resources, but those costs were present in only of 2% of the cases of a bypass cost.

When any B2B (write or read) cost was scheduled, there were generally not many nodes available to choose from:

  1 available: 50%
  2 available: 27%
  3 available: 11%
  6 or more available: 3%

  For the B2BW nodes which did not get handled, in 92% of the cases it was the only node available.
  For the B2BR nodes which ended up on wrong side, 88% of them where the only node available.

So it seems that the potential of this patch is limited by the fact that the starting point is not "0%", but rather around 50%, and also because the availability of alternate nodes is typically low.

With -sidesteer-exact, only scheduling in a following decoder group on the same slot (modulo 6 instructions) is aimed for which would be immune to incorrect tracking of decoder groups (linear predecessor fall-through).
This gave much fewer known beneficially scheduled reads, which should be due to the low number of available instructions.

Possible improvements / ideas:

- If an instruction uses two registers both defined with a B2BW, one could try to put both definitions on same side.
- It was a good while since I checked but it may be worth looking into "breaking anti-dependencies" before post-ra sched, to perhaps make more instructions available.

Benchmarks:

I compared master to "-sidesteer-fxu -sidesteer-lastslot" (1), "-sidesteer-fxu -sidesteer-exact" (2), and "-sidesteer-fxu -newfpd-sides" (3).

1. This gave small mixed results during the first run of SPEC-17. A few benchmarks were then rerun in "full" mode and it seemed that out of these namd and xalancbmk improved ~1%. Namd is not an integer benchmark, but maybe this was related to some induction variable in some loop?

2. Gave in the "full" run 2% improvement on xalancbmk, but also 2% regression on omnetpp, and 1% regression on lbm.

3. This used the tracked groups for FPd ops, but this did not seem to improve benchmarks this time around either.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D44092/new/

https://reviews.llvm.org/D44092

Files:
  llvm/lib/Target/SystemZ/SystemZHazardRecognizer.cpp
  llvm/lib/Target/SystemZ/SystemZHazardRecognizer.h
  llvm/lib/Target/SystemZ/SystemZMachineScheduler.cpp
  llvm/lib/Target/SystemZ/SystemZMachineScheduler.h
  llvm/lib/Target/SystemZ/SystemZSchedule.td
  llvm/lib/Target/SystemZ/SystemZScheduleZ14.td
  llvm/lib/Target/SystemZ/SystemZScheduleZ15.td
  llvm/test/CodeGen/SystemZ/postra-sched-sidesteer.mir

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D44092.315875.patch
Type: text/x-patch
Size: 90296 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20210111/7ec2ee65/attachment.bin>