[LLVMdev] "Anti" scheduling with OoO cores?

Tue Nov 4 00:34:18 PST 2014

> On Nov 2, 2014, at 4:46 AM, James Molloy <james at jamesmolloy.co.uk> wrote:
> 
> Hi Andy, Dave,
> 
> I've been doing a bit of experimentation trying to understand the schedmodel a bit better and improving modelling of FDIV (on Cortex-A57).
> 
> FDIV is not pipelined, and blocks other FDIV operations (FDIVDrr and FDIVSrr). This seems to be already semi-modelled, with a "ResourceCycles=[18]" line in the SchedWriteRes for this instruction.

Pretty typical - we should be able to handle this.

> This doesn't seem to work (a poor schedule is produced) so I changed it to also require another resource that I modelled as unbuffered (BufferSize=0), in the hope that this would "block" other FDIVs... no joy.

That should create a hazard that blocks scheduling of the FDIVs. So that was the right thing to do, assuming that’s what you want - register pressure could suffer in some cases.

ResourceCycles is an ordered list. It’s only going to stall if the unbuffered resource is the one taking 18 cycles. You didn’t attach your patch though, so I can’t be sure what your actually did...

> Then I noticed that the MicroOpBufferSize is set to 128, which is wildly high as Cortex-A57 has separated smaller reorder buffers, not one larger reorder buffer.
> Even reducing it down to "2" made no effect, the divs were scheduled in a clump together. But "1" and "0" (denoting in-order) produced a nice schedule.

There’s a huge difference between 0, 1, and > 1. Beyond that, the generic scheduler only cares in some cases of very tight loops. Your example is straight line code so it won’t matter. You could model buffers on the individual resources instead to be more precise, but I don’t think it will matter much unless you start customizing heuristics by plugging in a new scheduling strategy.

> I'd expect an OoO machine with a buffer of 2 ops would produce a very similar schedule as an in-order machine. So where am I going wrong?

See above. The machine model is much more precise than the scheduler’s internal model. It would be possible to approximately simulate the behavior of the reorder buffer, but since most OoO machines have such large buffers now, it’s not worth adding the cost and complexity to the generic scheduler. At least I wasn’t able to find real examples where it mattered.

> Sample attached - I'd expect the FDIVs to be equally spread across the MULs.

The stalls should be modeled as long as the FDIV uses an unbuffered resource for 18 cycles and the MUL does not use the same resource at all. But the way in-order hazards work in the scheduler, you may end up with three MULs strangely smashed between two FDIVs.

To get a more even dispersement, you can try BufferSize=1. That basically prioritizes for latency, but is very sensitive to a bunch of heuristics.

> (The extension to this I want to model is that we can have 2 S-register FDIVs in parallel but only one D-reg FDIV, and never both, but that can wait until I've understood what's going on here!).

Hmm. The implementation of inorder scheduling with the new machine model is pretty lame. It was a quick fix to get something working. It needs to be extended so that it separately counts cycles for multiple units of the same resource. It would be straightforward enough to do that. I can’t really volunteer at the moment though.

-Andy

> 
> Cheers,
> 
> James