[llvm-dev] mischeduler (pre-RA) experiments

Mon Nov 27 07:57:16 PST 2017

Hi,

On 23/11/2017 10:53, Jonas Paulsson via llvm-dev wrote:
> Hi,
> 
> I have been experimenting for a while with tryCandidate() method of the 
> pre-RA mischeduler. I have by chance found some parameters that give 
> quite good results on benchmarks on SystemZ (on average 1% improvement, 
> some improvements of several percent and very little regressions). 
> Basically, I add a "latency heuristic boost" just above processor 
> resources checking:
> 
> tryCandidate() {
>     ...
> 
>     // Avoid increasing the max pressure of the entire region.
>     if (DAG->isTrackingPressure() && 
> tryPressure(TryCand.RPDelta.CurrentMax,
>          Cand.RPDelta.CurrentMax, TryCand, Cand, RegMax, TRI, DAG->MF))
>       return;
> 
>     /// INSERTION POINT
> 
>     ...
> }
> 
> I had started to experiment with adding tryLatency() in various places, 
> and found this to be the best spot for SystemZ/SPEC-2006. This gave 
> noticeable improvements immediately that were to good to ignore, so I 
> started figuring out things about the regressions that of course also 
> showed up. Eventually I have come up after many iterations a combined 
> heuristic that reads:
> 
> if (((TryCand.Latency >= 7 && "Longest latency of any SU in DAG" < 15) ||
>       "Number of SUnits in DAG" > 180)
>       &&
>       tryLatency(TryCand, Cand, *Zone))
>          return;
> 
> In English: do tryLatency either if the latency of the candidate is >= 7 
> and the DAG has no really long latency SUs (lat > 15), or alternatively 
> always if the DAG is really big (>180 SUnits).
> 

Thanks for those experiments! I made similar observations when trying to 
tune the scheduling heuristics for AArch64/ARM cores. For example, I put 
this patch up for review, that makes scheduling for latency more 
aggressive https://reviews.llvm.org/D38279. It gave +0.74% on SPEC2017 
score on Cortex-A57. But I never really pushed any further on this so far.

The thing I found is that it seems like when deciding to schedule for 
latency during bottom-up scheduling we use CurrZone.getCurrCycle() to 
get the number of issued cycles, which is then added to the remaining 
latency. Unless I miss something, the cycle will get bumped by one after 
scheduling an instruction, regardless of the latency. It seems like 
CurrZone.getScheduledLatency() would more accurately represent to 
latency scheduled currently, but I am probably missing something.

The test case I was looking into on AArch64 was, where the long latency 
instruction SDIV was not scheduled as early as possible.

define hidden i32 @foo(i32 %a, i32 %b, i32 %c, i32* %d) 
local_unnamed_addr #0 {
entry:
   %xor = xor i32 %c, %b
   %ld = load i32, i32* %d
   %add = add nsw i32 %xor, %ld
   %div = sdiv i32 %a, %b
   %sub = sub i32 %div, %add
   ret i32 %sub
}

Cheers,
Florian