[LLVMdev] specializing hybrid_ls_rr_sort (was: Re: Bottom-Up Scheduling?)

Tue Dec 20 10:29:26 PST 2011

On Tue, 2011-12-20 at 10:35 -0600, Hal Finkel wrote:
> On Mon, 2011-12-19 at 23:20 -0800, Andrew Trick wrote:
> > 
> > On Dec 19, 2011, at 10:53 PM, Hal Finkel wrote:
> > 
> > > Here's my "thought experiment" (from PR11589): I have a bunch of
> > > load-fadd-store chains to schedule. A store takes two cycles to
> > > clear
> > > its last pipeline stage. The fadd takes longer to compute its result
> > > (say 5 cycles), but can sustain a rate of 1 independent add per
> > > cycle.
> > > As the scheduling is bottom-up, it will schedule a store, then it
> > > has a
> > > choice: it can schedule another store (at a 1 cycle penalty), or it
> > > can
> > > schedule the fadd associated with the store it just scheduled (with
> > > a 4
> > > cycle penalty due to operand latency). It seems that the current
> > > hybrid
> > > scheduler will choose the fadd, I want a scheduler that will make
> > > the
> > > opposite choice.
> > 
> > That's just wrong. You may need to look at -debug-only=pre-RA-sched
> > and debug your itinerary.
> 
> Andy, I've already looked at the debug output quite a bit; please help
> me understand what I'm missing...
> 
> First, looking at the code does seem to confirm my suspicion. This is
> certainly is low-pressure mode, and so hybrid_ls_rr_sort::operator()
> will return the result of BUCompareLatency. That function first checks
> for stalls and returns 1 or -1. Only after that does it look at the
> relative latencies.

Looking at this more carefully, I think that I see the problem. The
heights are set to account for the latencies:
PredSU->setHeightToAtLeast(SU->getHeight() + PredEdge->getLatency());

but the latencies are considered only if the node as an ILP scheduling
preference (the default in TargetLowering.h is None):
  bool LStall = (!checkPref || left->SchedulingPref == Sched::ILP) &&
    BUHasStall(left, LHeight, SPQ);
...

and the PPC backend does not override getSchedulingPreference.

 -Hal

> 
> In addition, the stall computation is done using BUHasStall, and that
> function only checks the current cycle. Without looking forward, I don't
> understand how it could know how long the pipeline hazard will last.
> 
> It looks like this may have something to do with the height. Can you
> explain how that is supposed to work?
> 
> For the specific example: We start with the initial store...
> 
> GPRC: 4 / 31
> F4RC: 1 / 31
> 
> Examining Available:
> Height 2: SU(102): 0x2c03f70: ch = STFSX 0x2c03c70, 0x2bf3910,
> 0x2c03870, 0x2c03e70<Mem:ST4[%arrayidx6.14](align=8)(tbaa=!"float")>
> [ORD=94] [ID=102]
> 
> Height 2: SU(97): 0x2c03470: ch = STFSX 0x2c03170, 0x2bf3910, 0x2c02c60,
> 0x2c03370<Mem:ST4[%arrayidx6.13](tbaa=!"float")> [ORD=88] [ID=97]
> 
> Height 2: SU(92): 0x2c02860: ch = STFSX 0x2c02560, 0x2bf3910, 0x2c02160,
> 0x2c02760<Mem:ST4[%arrayidx6.12](align=16)(tbaa=!"float")> [ORD=82]
> [ID=92]
> 
> Height 2: SU(90): 0x2c01c50: ch = STFSX 0x2c01950, 0x2bf3910, 0x2c01550,
> 0x2c01b50<Mem:ST4[%arrayidx6.11](tbaa=!"float")> [ORD=76] [ID=90]
> 
> Height 18: SU(85): 0x2c01150: ch = STFSX 0x2c00d40, 0x2bf3910,
> 0x2c00940, 0x2c00f40<Mem:ST4[%arrayidx6.10](align=8)(tbaa=!"float")>
> [ORD=70] [ID=85]
> 
> *** Scheduling [21]: SU(102): 0x2c03f70: ch = STFSX 0x2c03c70,
> 0x2bf3910, 0x2c03870, 0x2c03e70<Mem:ST4[%
> arrayidx6.14](align=8)(tbaa=!"float")> [ORD=94] [ID=102]
> 
> then it schedules a "token factor" that is attached to the address
> computation required by the store (this is essentially a no-op,
> right?)...
> 
> GPRC: 5 / 31
> F4RC: 2 / 31
> 
> Examining Available:
> Height 21: SU(5): 0x2c03e70: ch = TokenFactor 0x2c00c40:1, 0x2c03a70
> [ORD=94] [ID=5]
> 
> Height 24: SU(105): 0x2c03c70: f32 = FADDS 0x2c03b70, 0x2bf3710 [ORD=92]
> [ID=105]
> 
> Height 2: SU(97): 0x2c03470: ch = STFSX 0x2c03170, 0x2bf3910, 0x2c02c60,
> 0x2c03370<Mem:ST4[%arrayidx6.13](tbaa=!"float")> [ORD=88] [ID=97]
> 
> Height 2: SU(92): 0x2c02860: ch = STFSX 0x2c02560, 0x2bf3910, 0x2c02160,
> 0x2c02760<Mem:ST4[%arrayidx6.12](align=16)(tbaa=!"float")> [ORD=82]
> [ID=92]
> 
> Height 2: SU(90): 0x2c01c50: ch = STFSX 0x2c01950, 0x2bf3910, 0x2c01550,
> 0x2c01b50<Mem:ST4[%arrayidx6.11](tbaa=!"float")> [ORD=76] [ID=90]
> 
> Height 18: SU(85): 0x2c01150: ch = STFSX 0x2c00d40, 0x2bf3910,
> 0x2c00940, 0x2c00f40<Mem:ST4[%arrayidx6.10](align=8)(tbaa=!"float")>
> [ORD=70] [ID=85]
> 
> *** Scheduling [21]: SU(5): 0x2c03e70: ch = TokenFactor 0x2c00c40:1,
> 0x2c03a70 [ORD=94] [ID=5]
> 
> how here is the choice that we may want to be different...
> 
> GPRC: 5 / 31
> F4RC: 2 / 31
> 
> Examining Available:
> Height 24: SU(105): 0x2c03c70: f32 = FADDS 0x2c03b70, 0x2bf3710 [ORD=92]
> [ID=105]
> 
> Height 2: SU(97): 0x2c03470: ch = STFSX 0x2c03170, 0x2bf3910, 0x2c02c60,
> 0x2c03370<Mem:ST4[%arrayidx6.13](tbaa=!"float")> [ORD=88] [ID=97]
> 
> Height 2: SU(92): 0x2c02860: ch = STFSX 0x2c02560, 0x2bf3910, 0x2c02160,
> 0x2c02760<Mem:ST4[%arrayidx6.12](align=16)(tbaa=!"float")> [ORD=82]
> [ID=92]
> 
> Height 2: SU(90): 0x2c01c50: ch = STFSX 0x2c01950, 0x2bf3910, 0x2c01550,
> 0x2c01b50<Mem:ST4[%arrayidx6.11](tbaa=!"float")> [ORD=76] [ID=90]
> 
> Height 18: SU(85): 0x2c01150: ch = STFSX 0x2c00d40, 0x2bf3910,
> 0x2c00940, 0x2c00f40<Mem:ST4[%arrayidx6.10](align=8)(tbaa=!"float")>
> [ORD=70] [ID=85]
> 
> (with more debug turned on, I also see a bunch of messages like:
> *** Hazard in cycle 3, SU(97): xxx: ch = STFSX ...<Mem:ST4[%
> arrayidx6.13](tbaa=!"float")> [ORD=88] [ID=97]
> one of these for each of the other possible stores).
> 
> *** Scheduling [24]: SU(105): 0x2c03c70: f32 = FADDS 0x2c03b70,
> 0x2bf3710 [ORD=92] [ID=105]
> 
> why did it choose this fadd over any of the other stores? the
> corresponding unit descriptions are:
> 
> SU(102): 0x2c03f70: ch = STFSX 0x2c03c70, 0x2bf3910, 0x2c03870,
> 0x2c03e70<Mem:ST4[%arrayidx6.14](align=8)(tbaa=!"float")> [ORD=94]
> [ID=102]
> 
>   # preds left       : 4
>   # succs left       : 1
>   # rdefs left       : 0
>   Latency            : 7
>   Depth              : 0
>   Height             : 0
>   Predecessors:
>    val #0x2c11ff0 - SU(105): Latency=3
>    val #0x2c0cdd0 - SU(32): Latency=1
>    val #0x2c11db0 - SU(103): Latency=1
>    ch  #0x2c0af70 - SU(5): Latency=0
>   Successors:
>    ch  #0x2c0ac10 - SU(2): Latency=1
> 
> SU(105): 0x2c03c70: f32 = FADDS 0x2c03b70, 0x2bf3710 [ORD=92] [ID=105]
> 
>   # preds left       : 2
>   # succs left       : 1
>   # rdefs left       : 1
>   Latency            : 11
>   Depth              : 0
>   Height             : 0
>   Predecessors:
>    val #0x2c12110 - SU(106): Latency=6
>    val #0x2c0d130 - SU(35): Latency=6
>   Successors:
>    val #0x2c11c90 - SU(102): Latency=3
> 
> Just from the debugging messages, it looks like what is happening is
> that the scheduler is first rejecting the other stores because of
> pipeline hazards and then picking the instruction with the lowest
> latency. Looking at the code, it seems that this is exactly what it was
> designed to do. If I'm wrong about that, please explain.
> 
> Thanks in advance,
> Hal
> 
> > 
> > 
> > -Andy
> 

-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory