[PATCH] D45229: [MI-sched] schedule following instruction latencies

Wed Apr 4 11:34:15 PDT 2018

sebpop added a comment.

In https://reviews.llvm.org/D45229#1056487, @MatzeB wrote:

> - Are you sure you are using a good scheduling model?

I have seen badly scheduled code for -mcpu=cortex-a57 and the exynos-m* tunings.

Curiously cortex-a53 has the best code after inst-combine manages to fuse all the byte ld/st.
Evandro will be investigating why this does not happen for other -mcpu tunings.

> In my experience real world looses all too fast when additional latency heuristics fight against register pressure. (See also the code around tryLatency()).

Thanks for the pointer to tryLatency(): the good news is that the function
has the exact code as I wrote in this patch.
tryLatency is not executed because TryCand.Policy.ReduceLatency is false:
it is only set in the following code, so the problem may be in this logic:

  // Schedule aggressively for latency in PostRA mode. We don't check for
  // acyclic latency during PostRA, and highly out-of-order processors will
  // skip PostRA scheduling.
  if (IsPostRA || (RemLatency + CurrZone.getCurrCycle() > Rem.CriticalPath)) {
    Policy.ReduceLatency |= true;

at the beginning of scheduling we are not on the critical path, so we won't set ReduceLatency.
Then we schedule an instruction without looking at its latency, which introduces a bubble
that will then exceed the critical path and consequently force to set ReduceLatency
for the remaining instructions to be scheduled.

In https://reviews.llvm.org/D45229#1056570, @javed.absar wrote:

> Hi Sebastian:
>  For the test case, the change looks good, though I am surprised the current scheduling algorithm does that (i.e. scheduling store just one cycle after load which definitely would result in stalls, especially for in-order processors).

See the comment that I added inline to the place that falls through to the original scheduling order.

================
Comment at: llvm/lib/CodeGen/MachineScheduler.cpp:3018
         || (!Zone->isTop() && TryCand.SU->NodeNum > Cand.SU->NodeNum)) {
       TryCand.Reason = NodeOrder;
     }
----------------
If everything above does not early return, this will pick up the instruction with the smallest SU number in top-down scheduling, and the largest SU number for bottom up scheduling.

In the testcase compiled with -mcpu=cortex-a57 the load at SU(8) will be picked up by the bot scheduler before other roots at SU 3, 5, 7 for this "ORDER" reason.

```
*** Final schedule for %bb.0 ***
SU(0):   %1:gpr64common = COPY $x1
SU(1):   %0:gpr64common = COPY $x0
SU(2):   %2:gpr32 = LDRBBui %1:gpr64common, 0 :: (load 1 from %ir.out, !tbaa !2)
SU(4):   %3:gpr32 = LDRBBui %1:gpr64common, 1 :: (load 1 from %ir.incdec.ptr, !tbaa !2)
SU(6):   %4:gpr32 = LDRBBui %1:gpr64common, 2 :: (load 1 from %ir.incdec.ptr2, !tbaa !2)
SU(3):   STRBBui %2:gpr32, %0:gpr64common, 0 :: (store 1 into %ir.in, !tbaa !2)
SU(5):   STRBBui %3:gpr32, %0:gpr64common, 1 :: (store 1 into %ir.incdec.ptr1, !tbaa !2)
SU(7):   STRBBui %4:gpr32, %0:gpr64common, 2 :: (store 1 into %ir.incdec.ptr3, !tbaa !2)
SU(8):   %5:gpr32 = LDRBBui %1:gpr64common, 3 :: (load 1 from %ir.incdec.ptr4, !tbaa !2)
SU(9):   STRBBui %5:gpr32, %0:gpr64common, 3 :: (store 1 into %ir.incdec.ptr5, !tbaa !2)
```

SU(8) gets in the ready queue after the bottom-up schedules SU(9) who consumes the output of 8.
Scheduling 8 just before 9 is very bad as there are no other instructions to hide the latency of 8.

https://reviews.llvm.org/D45229