[llvm-dev] MachineScheduler not scheduling for latency

Wed Sep 25 02:09:52 PDT 2019

On Fri, 13 Sep 2019 at 21:57, Andrew Trick <atrick at apple.com> wrote:
> > On Sep 10, 2019, at 7:32 AM, Jay Foad <jay.foad at gmail.com> wrote:
> >
> > Thanks for the explanations. Yes AMDGPU is in-order and has
> > MicroOpBufferSize = 1.
>
> Ok good. When MicroOpBufferSize==1, I would expect the Generic strategy to do something close to what you want.
>
> More background:
>
> MicroOpBufferSize==0 enables classic in-order scheduling, where reducing stalls takes priority at each step. This works via the traditional list-scheduling approach of leaving instructions in the pending queue until their latency is met.
>
> MicroOpBufferSize==1 allows for more nuanced heuristics. You can now pay to reduce register pressure by spending stall cycles. But, that doesn't mean the scheduler should totally ignore stall cycles.
>
> [Incidentally, I think controlling these scheduling modes with MicroOpBufferSize is a bit silly. It would be nice to have three explicit modes: vliw-ish, inorder-ish, out-of-order, and have MicroOpBufferSize just default to 1.]
>
> In this in-order mode, the scheduler does account for the stall cycles due to latency. Sometimes, that means the scheduler actually deprioritizes latency because it can now see that some latency can be hidden by other long latency ops.
>
> Let's step through the relevant prioritization in GenericScheduler::tryCandidate:
>
> 1. RegExcess: this means something is going to spill right here right now
>
> 2. RegCritical: this means that the overall register pressure exceeds
>    your target's limit. I think this is the reason for your "bad"
>    scheduling decision. You should check if this register pressure
>    heuristic is doing the right thing for you, and maybe fix that
>    instead. Or, if stalls are always more important than register
>    pressure, move to MicroOpBufferSize==0 or plugin your own strategy.

Yes I think you're basically right about the bad scheduling being
caused by #2. We have a fundamental tension between register pressure
and stalls. Both of them are always more important than the other :-)

> 3. PathReduce | DepthReduce via Rem.IsAcyclicLatencyLimited &&
>    tryLatency: I expect checkAcyclicLatency will probably work ok for
>    your machine model, but it only applies to single-block loops. It
>    looks like this particular case is not a loop?
>
> 4. Stall via getLatencyStallCycles: I expect this to generally avoid
>    the kind of latency-induced stalls that you are referring to. If
>    you aren't being caught be a register pressure problem above, then
>    this should handle your case.

Thanks. I will take a look at why #4 is not handling stalls in the way
I'd expect.

> 5. PathReduce | DepthReduce via TryCand.Policy.ReduceLatency: ensures
>    that latency hiding is preferred over source order. First,
>    shouldReduceLatency needs to return true; is that already the case
>    for you? Next, the result of shouldReduceLatency can't be overridden
>    when it sees that the code is resource limited. This override
>    appears to be what you're proposing to fix...

Yes!

One of the effects I'm seeing is that small changes in the input cause
drastic changes in the resulting schedule. This makes life difficult
because innocuous changes elsewhere in the compiler can apparently
cause large performance regressions which we have to track down.

And one reason for this is that getOtherResourceCount /
checkResourceLimit decide whether the code is issue limited or not
based on the number of ops vs the length of the critical path, which
can easily flip-flop; and this overrides the result of
shouldReduceLatency, which affects whether tryCandidate calls
tryLatency, or falls back to source order.

Given that #5 is a last resort criterion before falling back to source
order, I don't understand why it ever makes sense to disable it.

> > Re "issue limited" and instruction groups: could it make sense to
> > disable the generic scheduler's detection of issue limitation on
> > in-order CPUs, or on CPUs that don't define instruction groups, or
> > some similar condition? Something like:
> >
> > --- a/lib/CodeGen/MachineScheduler.cpp
> > +++ b/lib/CodeGen/MachineScheduler.cpp
> > @@ -2062,10 +2062,13 @@ getOtherResourceCount(unsigned &OtherCritIdx) {
> >   if (!SchedModel->hasInstrSchedModel())
> >     return 0;
> >
> > -  unsigned OtherCritCount = Rem->RemIssueCount
> > -    + (RetiredMOps * SchedModel->getMicroOpFactor());
> > -  LLVM_DEBUG(dbgs() << "  " << Available.getName() << " + Remain MOps: "
> > -                    << OtherCritCount /
> > SchedModel->getMicroOpFactor() << '\n');
> > +  unsigned OtherCritCount = 0;
> > +  if (some condition) {
> > +    OtherCritCount = Rem->RemIssueCount
> > +      + (RetiredMOps * SchedModel->getMicroOpFactor());
> > +    LLVM_DEBUG(dbgs() << "  " << Available.getName() << " + Remain MOps: "
> > +                      << OtherCritCount /
> > SchedModel->getMicroOpFactor() << '\n');
> > +  }
> >   for (unsigned PIdx = 1, PEnd = SchedModel->getNumProcResourceKinds();
> >        PIdx != PEnd; ++PIdx) {
> >     unsigned OtherCount = getResourceCount(PIdx) + Rem->RemainingCounts[PIdx];
> >
> > As for "shouldReduceLatency should not be relevant at
> > MicroOpBufferSize = 1": are you suggesting that shouldReduceLatency
> > should effectively be changed to always return true on in-order CPUs?
> > Even with that change, latency comes pretty far down the list of
> > criteria tested in GenericScheduler::tryCandidate.
> >
> > Thanks,
> > Jay.
>
> I don't think that there should be any difference between issue limitations and other resources. Whether one is more "normal" than the other totally depends on the machine model.
>
> I also don't think it makes sense to always prioritize the critical path in in-order mode, because heuristic #4 above should already handle any stalls that occur as a result of the machine being in-order.
>
> -Andy

Thanks again,
Jay.