[PATCH] D44092: [SystemZ] Improve side steering of FPd unit and FXU registers.

Sun Feb 7 16:57:14 PST 2021

jonpa updated this revision to Diff 322006.
jonpa added a comment.

I started to simplify the patch and handled one minor regression and then realized something... (see below :-)

Patch simplified towards getting ready to commit:

- The look-ahead algorithm removed.
- Minor regression on imagick handled by having a "height cutoff": I figured that the only way this patch could make things worse was by pulling a lot of lower nodes up in the schedule so much that it would slow down the critical path. I found that by using a height limit cutoff that would take the higher node regardless of the B2B cost, I could get nearly all of the lbm improvement while also eliminating the imagick (2% slowdown). With the best value (-2), I now see +100k more "Good" edges instead of +147k, but benchmarking seemed to say this was better even though the total improvement number is less.

- B2BWrites: The "OtherOpSides" handling checked if the B2BR of an *unscheduled* B2BW has *another* B2BR operand whose definition is already scheduled on the other ("wrong") side. If so, a positive cost was returned in the hope of scheduling the B2BW later on the other side. This gave merely <500 more "Good" operands. 96% of the B2B readers have only one B2B operand, and the increase with this handling gives only an extra 0.3% improvement, so this was removed-

- The "OtherPred" handling is meant to handle the case where an available (unscheduled) B2BW has a B2BR successor, but the B2BR also has one other predecessor unscheduled. This gave ~7k more "Good" operands (without the height-cutoff). This is still here, but it could perhaps be removed. The "triangle" cases (where that other B2BR predecessor is the B2BW SU itself) did not seem to give any benefit on benchmarks (about 900 more "Good" instructions), so it was removed for sake of simplicity.

At this point I was happy to have a 10% improvement on LBM, whith an eliminated slowdown of imagick I had seen for a while. There were actually 1-2 other 2-3% improvements as well with the reduced benchmarking, but with -F there was only LBM.

Since only one single benchmark improved notably, I wanted to see if this was really B2B related or not - LBM contains many FP divides, and it might also be related to grouping (different schedules might make a cracked instruction available on the right slot etc...)

Experiments showed that rescheduling MBB3 *alone* in lbm.s (LBM_performStreamCollide_TRT) with this patch applied gives the 8-9% speedup! And it did indeed seem to be due to B2B side steering: With patch, the LA:R9 <https://reviews.llvm.org/source/libcxxabi/> and 5 of its 6 users are on the same side, and all the AGFI:s are on the right side, which seems to be near ideal: the data-flow happening on the same side:

patched:

          jhe     .LBB8_8             // NT branch on first slot
  .LBB8_3:                                # %for.body
  					# =>This Inner Loop Header: Depth=1
  	la      %r9, 0(%r1,%r3)
  	l       %r0, 152(%r1,%r2)
       ld      %f1, 0(%r1,%r2)
       pfd     1, 1432(%r1,%r2)
       pfd     2, 1280(%r1,%r3)
  	lgr     %r5, %r9            // R9 B2B
  	lgr     %r8, %r9
  	lgr     %r7, %r9
       pfd     2, -14704(%r1,%r3)
       pfd     2, 17288(%r1,%r3)
       lgr     %r6, %r9              // R9 read on wrong side just once
  	lgr     %r4, %r9           // R9 B2B
  	lgr     %r10, %r9
  	agfi    %r5, 1617368
       agfi    %r6, -1614608         // R6 B2B
       pfd     2, 0(%r6)
       pfd     2, 0(%r5)
  	agfi    %r8, -1582624      // R8, R7, R4 B2B
  	agfi    %r7, 1585384
  	agfi    %r4, 1601320
       pfd     2, 0(%r4)
       pfd     2, 0(%r7)
       pfd     2, 0(%r8)
  	agfi    %r10, -1598672     // R10 B2B
  	tmll    %r0, 1
  	pfd     2, 0(%r10)
       jne     .LBB8_1

unpatched:

          jhe     .LBB8_8             // NT branch on first slot
  LBB8_3:                                # %for.body
  					# =>This Inner Loop Header: Depth=1
  	l       %r0, 152(%r1,%r2)
  	ld      %f1, 0(%r1,%r2)
        la      %r9, 0(%r1,%r3)
        lgr     %r5, %r9
        lgr     %r8, %r9
  	lgr     %r7, %r9          // 3 R9 reads on wrong side
  	lgr     %r6, %r9
  	lgr     %r4, %r9
        lgr     %r10, %r9
        agfi    %r5, 1617368
        agfi    %r8, -1582624
  	agfi    %r7, 1585384
  	agfi    %r6, -1614608
  	pfd     1, 1432(%r1,%r2)
        agfi    %r4, 1601320        // R4 wrong side
        pfd     2, 1280(%r1,%r3)
        pfd     2, -14704(%r1,%r3)
  	agfi    %r10, -1598672    // R10 wrong side
  	tmll    %r0, 1
  	pfd     2, 17288(%r1,%r3)
        pfd     2, 0(%r10)
        pfd     2, 0(%r4)
        pfd     2, 0(%r6)
  	pfd     2, 0(%r7)
  	pfd     2, 0(%r8)
  	pfd     2, 0(%r5)
         jne     .LBB8_1

The trunk (unpatched) schedule is not entirely bad since it randomly puts a several reads B2B. It seems that in this case with many LGRs from the same LA, which in turn are used by AGFIs, it is clearly beneficial to have all that happening on one side. Since this involves a lot of register moves, maybe the z15 machine has a better register renaming support, which would explain why there is no further benchmark improvement on that machine?

So this was not due to FP divides, or cracked instructions -it was seemingly quite likely benefiting from the side-steering, which is what I wanted to see. BUT...

That code itself is only intended to compute the addresses for the prefetches, and it is far from optimal. It is a single LA, with multiple LGRs and AGFIs then used by PFDs without any displacement. So I decided to try and hand code that into better assembly:

  jhe     .LBB8_8
  .LBB8_3:                                # %for.body
  					# =>This Inner Loop Header: Depth=1
  	l       %r0, 152(%r1,%r2)
  	ld      %f1, 0(%r1,%r2)
  	la      %r9, 0(%r1,%r3)     // r9 is live out
  	la      %r7, 0(%r1,%r3)     // the other regs (r4-r8 seemed to be local only)
  	la      %r8, 0(%r1,%r3)     // (la seems just as fast as lgr per table...)
  	agfi    %r7, 1585384
  	agfi    %r8, -1614608
  	pfd     1, 1432(%r1,%r2)
  	pfd     2, 1280(%r1,%r3)
  	pfd     2, -14704(%r1,%r3)
  	tmll    %r0, 1
  	pfd     2, 17288(%r1,%r3)
  	pfd     2, 15936(%r8)
  	pfd     2, 15936(%r7)
  	pfd     2, 0(%r8)
  	pfd     2, 0(%r7)
  	pfd     2, 31984(%r7)
  	pfd     2, 31984(%r8)

Would doing this alone *without* this B2B patch provide the same benefit? Yes! And even more than before: 12%!!

I reran LBM with -F:

  clang unpatched                    : 335s
  Handcoded ber above                : 295s   88%
  GCC (ffp-contract=off)             : 362s  108%
  clang, disabling extra prefetching : 479s  143%

It looks like clang benefited heavily (and still does) from the increased prefetching, but there was still a missed further improvement to be made in regards to the generation of those addresses for the prefetches.

It seems to me now then that the address computations should first be handled and then possibly this patch could be revisited again (I tried putting the two LAs/AGFIs I used on the *wrong* sides, and it did not seem to matter...).

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D44092/new/

https://reviews.llvm.org/D44092

Files:
  llvm/lib/Target/SystemZ/SystemZHazardRecognizer.cpp
  llvm/lib/Target/SystemZ/SystemZHazardRecognizer.h
  llvm/lib/Target/SystemZ/SystemZMachineScheduler.cpp
  llvm/lib/Target/SystemZ/SystemZMachineScheduler.h
  llvm/lib/Target/SystemZ/SystemZSchedule.td
  llvm/lib/Target/SystemZ/SystemZScheduleZ14.td
  llvm/lib/Target/SystemZ/SystemZScheduleZ15.td
  llvm/test/CodeGen/SystemZ/postra-sched-sidesteer.mir

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D44092.322006.patch
Type: text/x-patch
Size: 91151 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20210208/392a8cde/attachment.bin>