[llvm-dev] mischeduler (pre-RA) experiments
Florian Hahn via llvm-dev
llvm-dev at lists.llvm.org
Mon Nov 27 07:57:16 PST 2017
Hi,
On 23/11/2017 10:53, Jonas Paulsson via llvm-dev wrote:
> Hi,
>
> I have been experimenting for a while with tryCandidate() method of the
> pre-RA mischeduler. I have by chance found some parameters that give
> quite good results on benchmarks on SystemZ (on average 1% improvement,
> some improvements of several percent and very little regressions).
> Basically, I add a "latency heuristic boost" just above processor
> resources checking:
>
> tryCandidate() {
> ...
>
> // Avoid increasing the max pressure of the entire region.
> if (DAG->isTrackingPressure() &&
> tryPressure(TryCand.RPDelta.CurrentMax,
> Cand.RPDelta.CurrentMax, TryCand, Cand, RegMax, TRI, DAG->MF))
> return;
>
> /// INSERTION POINT
>
> ...
> }
>
> I had started to experiment with adding tryLatency() in various places,
> and found this to be the best spot for SystemZ/SPEC-2006. This gave
> noticeable improvements immediately that were to good to ignore, so I
> started figuring out things about the regressions that of course also
> showed up. Eventually I have come up after many iterations a combined
> heuristic that reads:
>
> if (((TryCand.Latency >= 7 && "Longest latency of any SU in DAG" < 15) ||
> "Number of SUnits in DAG" > 180)
> &&
> tryLatency(TryCand, Cand, *Zone))
> return;
>
> In English: do tryLatency either if the latency of the candidate is >= 7
> and the DAG has no really long latency SUs (lat > 15), or alternatively
> always if the DAG is really big (>180 SUnits).
>
Thanks for those experiments! I made similar observations when trying to
tune the scheduling heuristics for AArch64/ARM cores. For example, I put
this patch up for review, that makes scheduling for latency more
aggressive https://reviews.llvm.org/D38279. It gave +0.74% on SPEC2017
score on Cortex-A57. But I never really pushed any further on this so far.
The thing I found is that it seems like when deciding to schedule for
latency during bottom-up scheduling we use CurrZone.getCurrCycle() to
get the number of issued cycles, which is then added to the remaining
latency. Unless I miss something, the cycle will get bumped by one after
scheduling an instruction, regardless of the latency. It seems like
CurrZone.getScheduledLatency() would more accurately represent to
latency scheduled currently, but I am probably missing something.
The test case I was looking into on AArch64 was, where the long latency
instruction SDIV was not scheduled as early as possible.
define hidden i32 @foo(i32 %a, i32 %b, i32 %c, i32* %d)
local_unnamed_addr #0 {
entry:
%xor = xor i32 %c, %b
%ld = load i32, i32* %d
%add = add nsw i32 %xor, %ld
%div = sdiv i32 %a, %b
%sub = sub i32 %div, %add
ret i32 %sub
}
Cheers,
Florian
More information about the llvm-dev
mailing list