[llvm-dev] Fwd: MachineScheduler not scheduling for latency

Mon Sep 9 04:22:40 PDT 2019

Hi,

I'm trying to understand why MachineScheduler does a poor job in
straight line code in cases like the one in the attached debug dump.
This is on AMDGPU, an in-order target, and the problem is that the
IMAGE_SAMPLE instructions have very high (80 cycle) latency, but in
the resulting schedule they are often placed right next to their uses
like this:

1784B     %140:vgpr_32 = IMAGE_SAMPLE_LZ_V1_V2 %533:vreg_64,
%30:sreg_256, %26:sreg_128, 8, 0, 0, 0, 0, 0, 0, 0, 0, implicit $exec
:: (dereferenceable load 4 from custom TargetCustom8)
1792B     %142:vgpr_32 = V_MUL_F32_e32 %44:sreg_32, %140:vgpr_32, implicit $exec
...
1784B     %140:vgpr_32 = IMAGE_SAMPLE_LZ_V1_V2 %533:vreg_64,
%30:sreg_256, %26:sreg_128, 8, 0, 0, 0, 0, 0, 0, 0, 0, implicit $exec
:: (dereferenceable load 4 from custom TargetCustom8)
1792B     %142:vgpr_32 = V_MUL_F32_e32 %44:sreg_32, %140:vgpr_32, implicit $exec

This can be improved slightly in post-ra scheduling, but not much. The
post-ra scheduler simply doesn't have enough freedom to move
instructions around once physical registers have been assigned, so I
contend that MachineScheduler needs to consider latency.

I've looked at what's going on in the debugger and the problem seems
to be that GenericSchedulerBase::setPolicy does not set
Policy.ReduceLatency because it thinks that the other zone (Top in
this case) is issue limited. There are lots of things I don't
understand here:

1. "Issue limited" seems to mean that the number of instructions is
greater than the length of the critical path. I don't really
understand why this is an interesting criterion. It seems to me like a
fairly normal state of affairs.
2. Why does the fact that it's issue limited mean that it's a good
idea to turn off the latency heuristic? Moving instructions around
can't really change whether the function is issue limited or not, but
it can definitely improve latency problems.
3. Why do we completely turn off the latency heuristic, rather than
just letting it take its usual (very low) priority in
GenericScheduler::tryCandidate?
4. Stepping back a bit, is MachineScheduler really trying to consider
latency at all in pre-ra mode? I see a comment that it "Schedules
aggressively for latency in PostRA mode", but what about pre-ra?

Of course we can and do override some of the generic logic in our
target, in lib/Target/AMDGPU/GCNSchedStrategy.cpp, but before going
further down that route I'd like to try to understand the intent of
the generic logic.

Thanks for any insights,
Jay.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: machine-scheduler.txt.gz
Type: application/gzip
Size: 62712 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190909/08853269/attachment-0001.bin>