[LLVMdev] Question about load clustering in the machine scheduler

Fri Mar 27 07:52:04 PDT 2015

On Thu, Mar 26, 2015 at 11:50:20PM -0700, Andrew Trick wrote:
> 
> > On Mar 26, 2015, at 7:36 PM, Tom Stellard <tom at stellard.net> wrote:
> > 
> > Hi,
> > 
> > I have a program with over 100 loads (each with a 10 cycle latency)
> > at the beginning of the program, and I can't figure out how to get
> > the machine scheduler to intermix ALU instructions with the loads to
> > effectively hide the latency.
> > 
> > It seems the issue is with load clustering.  I restrict load clustering
> > to 4 at a time, but when I look at the debug output, the loads are
> > always being scheduled based on the fact that that are clustered. e.g.
> > 
> > Pick Top CLUSTER
> > Scheduling SU(10) %vreg13<def> = S_BUFFER_LOAD_DWORD_IMM %vreg9, 4; mem:LD4[<unknown>] SGPR_32:%vreg13 SReg_128:%vreg9
> 
> Well, only 4 loads in a sequence should have the “cluster” edges. You should be able to see that when the DAG is printed before scheduling.
> 

There are 4 consecutive 'Pick Top CLUSTER' then a 'Pick Top WEAK' and
then the pattern repeats itself.  All of these are loads.

> Even without that limit, stalls take precedence over load clustering. So when you run out of load resources (15?) the scheduler should choose something else.
> 

Is this the code that checks for stalls?

  if (tryLess(Zone.getLatencyStallCycles(TryCand.SU),
              Zone.getLatencyStallCycles(Cand.SU), TryCand, Cand, Stall))

It is disabled if (!SU->isUnbuffered)

> > I have a feeling there is something wrong with my machine model in the
> > R600 backend, but I've experimented with a few variations of it and have
> > been unable to solve this problem.  Does anyone have any idea what I
> > might be doing wrong?
> 
> Sorry, not without actually looking through the debug output. The output lists the cycle time at each instruction, so you can see where the scheduler thinks the stalls are.
> 

There are actually 31 resources defined for loads.  However, there
aren't actually 31 load units in the hardware.  There is 1 load unit
that can hold up to 31 loads waiting to be executed, but only 1 load
can be executed at a time.

Pick Top CLUSTER   
Scheduling SU(43) %vreg46<def> = S_BUFFER_LOAD_DWORD_IMM %vreg9, 48; mem:LD4[<unknown>] SGPR_32:%vreg46 SReg_128:%vreg9
  SReg_32: 45 > 44(+ 0 livethru)
  VS_32: 51 > 18(+ 0 livethru)
  Ready @46c
  HWLGKM +1x105u
  TopQ.A BotLatency SU(43) 78c
  *** Max MOps 1 at cycle 46
Cycle: 47 TopQ.A
TopQ.A @47c
  Retired: 47
  Executed: 47c
  Critical: 47c, 47 MOps
  ExpectedLatency: 10c
  - Latency limited.
BotQ.A RemLatency SU(1698) 99c
  TopQ.A + Remain MOps: 1692
TopQ.A RemLatency SU(201) 97c
  BotQ.A + Remain MOps: 1647
BotQ.A: 1698 1694 1695

Here is example debugging output which.  Where is the cycle time
here?

> BTW- I just checked in a small fix for in-order scheduling that might make debugging this easier.
> 

I will take a look at this.

Thanks,
Tom

> Andy
> 
> > Here are my resource definitions from lib/Target/R600/SISchedule.td
> > 
> > // BufferSize = 0 means the processors are in-order.
> > let BufferSize = 0 in {
> > 
> > // XXX: Are the resource counts correct?
> > def HWBranch : ProcResource<1>;  
> > def HWExport : ProcResource<7>;   // Taken from S_WAITCNT
> > def HWLGKM   : ProcResource<31>;  // Taken from S_WAITCNT
> > def HWSALU   : ProcResource<1>;  
> > def HWVMEM   : ProcResource<15>;  // Taken from S_WAITCNT
> > def HWVALU   : ProcResource<1>;
> > 
> > }
> 
> > 
> > Thanks,
> > Tom
> > _______________________________________________
> > LLVM Developers mailing list
> > LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>