[LLVMdev] Question about load clustering in the machine scheduler
Tom Stellard
tom at stellard.net
Fri Mar 27 07:52:04 PDT 2015
On Thu, Mar 26, 2015 at 11:50:20PM -0700, Andrew Trick wrote:
>
> > On Mar 26, 2015, at 7:36 PM, Tom Stellard <tom at stellard.net> wrote:
> >
> > Hi,
> >
> > I have a program with over 100 loads (each with a 10 cycle latency)
> > at the beginning of the program, and I can't figure out how to get
> > the machine scheduler to intermix ALU instructions with the loads to
> > effectively hide the latency.
> >
> > It seems the issue is with load clustering. I restrict load clustering
> > to 4 at a time, but when I look at the debug output, the loads are
> > always being scheduled based on the fact that that are clustered. e.g.
> >
> > Pick Top CLUSTER
> > Scheduling SU(10) %vreg13<def> = S_BUFFER_LOAD_DWORD_IMM %vreg9, 4; mem:LD4[<unknown>] SGPR_32:%vreg13 SReg_128:%vreg9
>
> Well, only 4 loads in a sequence should have the “cluster” edges. You should be able to see that when the DAG is printed before scheduling.
>
There are 4 consecutive 'Pick Top CLUSTER' then a 'Pick Top WEAK' and
then the pattern repeats itself. All of these are loads.
> Even without that limit, stalls take precedence over load clustering. So when you run out of load resources (15?) the scheduler should choose something else.
>
Is this the code that checks for stalls?
if (tryLess(Zone.getLatencyStallCycles(TryCand.SU),
Zone.getLatencyStallCycles(Cand.SU), TryCand, Cand, Stall))
It is disabled if (!SU->isUnbuffered)
> > I have a feeling there is something wrong with my machine model in the
> > R600 backend, but I've experimented with a few variations of it and have
> > been unable to solve this problem. Does anyone have any idea what I
> > might be doing wrong?
>
> Sorry, not without actually looking through the debug output. The output lists the cycle time at each instruction, so you can see where the scheduler thinks the stalls are.
>
There are actually 31 resources defined for loads. However, there
aren't actually 31 load units in the hardware. There is 1 load unit
that can hold up to 31 loads waiting to be executed, but only 1 load
can be executed at a time.
Pick Top CLUSTER
Scheduling SU(43) %vreg46<def> = S_BUFFER_LOAD_DWORD_IMM %vreg9, 48; mem:LD4[<unknown>] SGPR_32:%vreg46 SReg_128:%vreg9
SReg_32: 45 > 44(+ 0 livethru)
VS_32: 51 > 18(+ 0 livethru)
Ready @46c
HWLGKM +1x105u
TopQ.A BotLatency SU(43) 78c
*** Max MOps 1 at cycle 46
Cycle: 47 TopQ.A
TopQ.A @47c
Retired: 47
Executed: 47c
Critical: 47c, 47 MOps
ExpectedLatency: 10c
- Latency limited.
BotQ.A RemLatency SU(1698) 99c
TopQ.A + Remain MOps: 1692
TopQ.A RemLatency SU(201) 97c
BotQ.A + Remain MOps: 1647
BotQ.A: 1698 1694 1695
Here is example debugging output which. Where is the cycle time
here?
> BTW- I just checked in a small fix for in-order scheduling that might make debugging this easier.
>
I will take a look at this.
Thanks,
Tom
> Andy
>
> > Here are my resource definitions from lib/Target/R600/SISchedule.td
> >
> > // BufferSize = 0 means the processors are in-order.
> > let BufferSize = 0 in {
> >
> > // XXX: Are the resource counts correct?
> > def HWBranch : ProcResource<1>;
> > def HWExport : ProcResource<7>; // Taken from S_WAITCNT
> > def HWLGKM : ProcResource<31>; // Taken from S_WAITCNT
> > def HWSALU : ProcResource<1>;
> > def HWVMEM : ProcResource<15>; // Taken from S_WAITCNT
> > def HWVALU : ProcResource<1>;
> >
> > }
>
> >
> > Thanks,
> > Tom
> > _______________________________________________
> > LLVM Developers mailing list
> > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
More information about the llvm-dev
mailing list