[llvm-dev] MachineScheduler not scheduling for latency

Fri Sep 13 13:57:18 PDT 2019

> On Sep 10, 2019, at 7:32 AM, Jay Foad <jay.foad at gmail.com> wrote:
> 
> Thanks for the explanations. Yes AMDGPU is in-order and has
> MicroOpBufferSize = 1.

Ok good. When MicroOpBufferSize==1, I would expect the Generic strategy to do something close to what you want.

More background:

MicroOpBufferSize==0 enables classic in-order scheduling, where reducing stalls takes priority at each step. This works via the traditional list-scheduling approach of leaving instructions in the pending queue until their latency is met.

MicroOpBufferSize==1 allows for more nuanced heuristics. You can now pay to reduce register pressure by spending stall cycles. But, that doesn't mean the scheduler should totally ignore stall cycles.

[Incidentally, I think controlling these scheduling modes with MicroOpBufferSize is a bit silly. It would be nice to have three explicit modes: vliw-ish, inorder-ish, out-of-order, and have MicroOpBufferSize just default to 1.]

In this in-order mode, the scheduler does account for the stall cycles due to latency. Sometimes, that means the scheduler actually deprioritizes latency because it can now see that some latency can be hidden by other long latency ops.

Let's step through the relevant prioritization in GenericScheduler::tryCandidate:

1. RegExcess: this means something is going to spill right here right now

2. RegCritical: this means that the overall register pressure exceeds
   your target's limit. I think this is the reason for your "bad"
   scheduling decision. You should check if this register pressure
   heuristic is doing the right thing for you, and maybe fix that
   instead. Or, if stalls are always more important than register
   pressure, move to MicroOpBufferSize==0 or plugin your own strategy.

3. PathReduce | DepthReduce via Rem.IsAcyclicLatencyLimited &&
   tryLatency: I expect checkAcyclicLatency will probably work ok for
   your machine model, but it only applies to single-block loops. It
   looks like this particular case is not a loop?

4. Stall via getLatencyStallCycles: I expect this to generally avoid
   the kind of latency-induced stalls that you are referring to. If
   you aren't being caught be a register pressure problem above, then
   this should handle your case.

5. PathReduce | DepthReduce via TryCand.Policy.ReduceLatency: ensures
   that latency hiding is preferred over source order. First,
   shouldReduceLatency needs to return true; is that already the case
   for you? Next, the result of shouldReduceLatency can't be overridden
   when it sees that the code is resource limited. This override
   appears to be what you're proposing to fix...

> Re "issue limited" and instruction groups: could it make sense to
> disable the generic scheduler's detection of issue limitation on
> in-order CPUs, or on CPUs that don't define instruction groups, or
> some similar condition? Something like:
> 
> --- a/lib/CodeGen/MachineScheduler.cpp
> +++ b/lib/CodeGen/MachineScheduler.cpp
> @@ -2062,10 +2062,13 @@ getOtherResourceCount(unsigned &OtherCritIdx) {
>   if (!SchedModel->hasInstrSchedModel())
>     return 0;
> 
> -  unsigned OtherCritCount = Rem->RemIssueCount
> -    + (RetiredMOps * SchedModel->getMicroOpFactor());
> -  LLVM_DEBUG(dbgs() << "  " << Available.getName() << " + Remain MOps: "
> -                    << OtherCritCount /
> SchedModel->getMicroOpFactor() << '\n');
> +  unsigned OtherCritCount = 0;
> +  if (some condition) {
> +    OtherCritCount = Rem->RemIssueCount
> +      + (RetiredMOps * SchedModel->getMicroOpFactor());
> +    LLVM_DEBUG(dbgs() << "  " << Available.getName() << " + Remain MOps: "
> +                      << OtherCritCount /
> SchedModel->getMicroOpFactor() << '\n');
> +  }
>   for (unsigned PIdx = 1, PEnd = SchedModel->getNumProcResourceKinds();
>        PIdx != PEnd; ++PIdx) {
>     unsigned OtherCount = getResourceCount(PIdx) + Rem->RemainingCounts[PIdx];
> 
> As for "shouldReduceLatency should not be relevant at
> MicroOpBufferSize = 1": are you suggesting that shouldReduceLatency
> should effectively be changed to always return true on in-order CPUs?
> Even with that change, latency comes pretty far down the list of
> criteria tested in GenericScheduler::tryCandidate.
> 
> Thanks,
> Jay.

I don't think that there should be any difference between issue limitations and other resources. Whether one is more "normal" than the other totally depends on the machine model.

I also don't think it makes sense to always prioritize the critical path in in-order mode, because heuristic #4 above should already handle any stalls that occur as a result of the machine being in-order.

-Andy

> 
> On Mon, 9 Sep 2019 at 19:36, Andrew Trick <atrick at apple.com> wrote:
>> 
>> 
>>> On Sep 9, 2019, at 4:22 AM, Jay Foad <jay.foad at gmail.com> wrote:
>>> 
>>> Hi,
>>> 
>>> I'm trying to understand why MachineScheduler does a poor job in
>>> straight line code in cases like the one in the attached debug dump.
>>> This is on AMDGPU, an in-order target, and the problem is that the
>>> IMAGE_SAMPLE instructions have very high (80 cycle) latency, but in
>>> the resulting schedule they are often placed right next to their uses
>>> like this:
>>> 
>>> 1784B     %140:vgpr_32 = IMAGE_SAMPLE_LZ_V1_V2 %533:vreg_64,
>>> %30:sreg_256, %26:sreg_128, 8, 0, 0, 0, 0, 0, 0, 0, 0, implicit $exec
>>> :: (dereferenceable load 4 from custom TargetCustom8)
>>> 1792B     %142:vgpr_32 = V_MUL_F32_e32 %44:sreg_32, %140:vgpr_32, implicit $exec
>>> ...
>>> 1784B     %140:vgpr_32 = IMAGE_SAMPLE_LZ_V1_V2 %533:vreg_64,
>>> %30:sreg_256, %26:sreg_128, 8, 0, 0, 0, 0, 0, 0, 0, 0, implicit $exec
>>> :: (dereferenceable load 4 from custom TargetCustom8)
>>> 1792B     %142:vgpr_32 = V_MUL_F32_e32 %44:sreg_32, %140:vgpr_32, implicit $exec
>>> 
>>> This can be improved slightly in post-ra scheduling, but not much. The
>>> post-ra scheduler simply doesn't have enough freedom to move
>>> instructions around once physical registers have been assigned, so I
>>> contend that MachineScheduler needs to consider latency.
>>> 
>>> I've looked at what's going on in the debugger and the problem seems
>>> to be that GenericSchedulerBase::setPolicy does not set
>>> Policy.ReduceLatency because it thinks that the other zone (Top in
>>> this case) is issue limited. There are lots of things I don't
>>> understand here:
>>> 
>>> 1. "Issue limited" seems to mean that the number of instructions is
>>> greater than the length of the critical path. I don't really
>>> understand why this is an interesting criterion. It seems to me like a
>>> fairly normal state of affairs.
>>> 2. Why does the fact that it's issue limited mean that it's a good
>>> idea to turn off the latency heuristic? Moving instructions around
>>> can't really change whether the function is issue limited or not, but
>>> it can definitely improve latency problems.
>>> 3. Why do we completely turn off the latency heuristic, rather than
>>> just letting it take its usual (very low) priority in
>>> GenericScheduler::tryCandidate?
>>> 4. Stepping back a bit, is MachineScheduler really trying to consider
>>> latency at all in pre-ra mode? I see a comment that it "Schedules
>>> aggressively for latency in PostRA mode", but what about pre-ra?
>>> 
>>> Of course we can and do override some of the generic logic in our
>>> target, in lib/Target/AMDGPU/GCNSchedStrategy.cpp, but before going
>>> further down that route I'd like to try to understand the intent of
>>> the generic logic.
>>> 
>>> Thanks for any insights,
>>> Jay.
>>> <machine-scheduler.txt.gz>
>> 
>> 
>> Hi Jay,
>> 
>> General disclaimer: The GenericScheduler is intended to exercise the key features that cover most of what backends will need, and have heuristics that are good enough for a typical out-of-order CPU. The intent was always that backend developers who care enough about scheduling will plugin a target-specific MachineSchedulingStrategy. The GenericScheduler also tends to take a "do no harm" approach and leave things in source order until it can determine that scheduling will help.
>> 
>> That said...
>> 
>> Yes, pre-RA is definitely the place to hide latency, but GenericScheduler currently does it very conservatively when MicroOpBufferSize > 1. Backends that need aggressive latency hiding will set MicroOpBufferSize = 1. I would expect you to be doing that for your target.
>> 
>> One reason latency is suppressed/ignored at times with MicroOpBufferSize > 1 is that some loops make good use of registers when left in source order, but exhibit spill code as soon as the scheduler starts shuffling things.
>> 
>> "Issue limited" means that forming instructions groups is on average more important than reducing the critical path. IsResourceLimited is similar but across all modeled processor resources. As you said, that doesn't imply that latency is irrelevant. That's probably being controlled by 'shouldReduceLatency', which looks to me like it suppresses latency hiding until the critical path becomes relevant. My guess is that it's extremely conservative about latency hiding because defaulting to latency hiding can get you backed into a corner with register pressure.
>> 
>> Again, shouldReduceLatency should not be relevant at MicroOpBufferSize = 1, because the scheduler should prioritize issue stalls over other heuristics.
>> 
>> Looking at the key decision point in your trace, I see the scheduler is being driven by the `REG-CRIT` heuristic. So, it thinks that failing to scheduled the use-def together would increase pressure in a critical register class.
>> 
>> There's no expectation that the GenericScheduler heuristics are anywhere near as good as they can be over the universe of workloads, but changing those heuristiscs means checking in with current backend maintainers to be sure you're not regressing their benchmarks in the process. I haven't touched these heuristcs in 6 years, but several others have had a hand in it recently.
>> 
>> -Andy
>>