[PATCH] D45229: [MI-sched] schedule following instruction latencies
Sebastian Pop via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Wed Apr 4 11:34:15 PDT 2018
sebpop added a comment.
In https://reviews.llvm.org/D45229#1056487, @MatzeB wrote:
> - Are you sure you are using a good scheduling model?
I have seen badly scheduled code for -mcpu=cortex-a57 and the exynos-m* tunings.
Curiously cortex-a53 has the best code after inst-combine manages to fuse all the byte ld/st.
Evandro will be investigating why this does not happen for other -mcpu tunings.
> In my experience real world looses all too fast when additional latency heuristics fight against register pressure. (See also the code around tryLatency()).
Thanks for the pointer to tryLatency(): the good news is that the function
has the exact code as I wrote in this patch.
tryLatency is not executed because TryCand.Policy.ReduceLatency is false:
it is only set in the following code, so the problem may be in this logic:
// Schedule aggressively for latency in PostRA mode. We don't check for
// acyclic latency during PostRA, and highly out-of-order processors will
// skip PostRA scheduling.
if (IsPostRA || (RemLatency + CurrZone.getCurrCycle() > Rem.CriticalPath)) {
Policy.ReduceLatency |= true;
at the beginning of scheduling we are not on the critical path, so we won't set ReduceLatency.
Then we schedule an instruction without looking at its latency, which introduces a bubble
that will then exceed the critical path and consequently force to set ReduceLatency
for the remaining instructions to be scheduled.
In https://reviews.llvm.org/D45229#1056570, @javed.absar wrote:
> Hi Sebastian:
> For the test case, the change looks good, though I am surprised the current scheduling algorithm does that (i.e. scheduling store just one cycle after load which definitely would result in stalls, especially for in-order processors).
See the comment that I added inline to the place that falls through to the original scheduling order.
================
Comment at: llvm/lib/CodeGen/MachineScheduler.cpp:3018
|| (!Zone->isTop() && TryCand.SU->NodeNum > Cand.SU->NodeNum)) {
TryCand.Reason = NodeOrder;
}
----------------
If everything above does not early return, this will pick up the instruction with the smallest SU number in top-down scheduling, and the largest SU number for bottom up scheduling.
In the testcase compiled with -mcpu=cortex-a57 the load at SU(8) will be picked up by the bot scheduler before other roots at SU 3, 5, 7 for this "ORDER" reason.
```
*** Final schedule for %bb.0 ***
SU(0): %1:gpr64common = COPY $x1
SU(1): %0:gpr64common = COPY $x0
SU(2): %2:gpr32 = LDRBBui %1:gpr64common, 0 :: (load 1 from %ir.out, !tbaa !2)
SU(4): %3:gpr32 = LDRBBui %1:gpr64common, 1 :: (load 1 from %ir.incdec.ptr, !tbaa !2)
SU(6): %4:gpr32 = LDRBBui %1:gpr64common, 2 :: (load 1 from %ir.incdec.ptr2, !tbaa !2)
SU(3): STRBBui %2:gpr32, %0:gpr64common, 0 :: (store 1 into %ir.in, !tbaa !2)
SU(5): STRBBui %3:gpr32, %0:gpr64common, 1 :: (store 1 into %ir.incdec.ptr1, !tbaa !2)
SU(7): STRBBui %4:gpr32, %0:gpr64common, 2 :: (store 1 into %ir.incdec.ptr3, !tbaa !2)
SU(8): %5:gpr32 = LDRBBui %1:gpr64common, 3 :: (load 1 from %ir.incdec.ptr4, !tbaa !2)
SU(9): STRBBui %5:gpr32, %0:gpr64common, 3 :: (store 1 into %ir.incdec.ptr5, !tbaa !2)
```
SU(8) gets in the ready queue after the bottom-up schedules SU(9) who consumes the output of 8.
Scheduling 8 just before 9 is very bad as there are no other instructions to hide the latency of 8.
https://reviews.llvm.org/D45229
More information about the llvm-commits
mailing list