[LLVMdev] New machine model questions

Thu Jan 30 15:47:56 PST 2014

On Jan 30, 2014, at 5:17 AM, Daniel Sanders <Daniel.Sanders at imgtec.com> wrote:

>> -----Original Message-----
>> From: Andrew Trick [mailto:atrick at apple.com]
>> Sent: 28 January 2014 23:10
>> To: Daniel Sanders
>> Cc: LLVM Developers Mailing List (llvmdev at cs.uiuc.edu)
>> Subject: Re: New machine model questions
>> 
>> 
>> On Jan 28, 2014, at 9:22 AM, Daniel Sanders <Daniel.Sanders at imgtec.com>
>> wrote:
>> 
>>> <snip>
>> 
>> The scheduler does not model which dispatch queue (or is it issue queue?)
>> the instructions reside in. For an OOO core, I think this is almost totally
>> unpredictable anyway. We assume (hope) that the hardware can balance the
>> queues.
> 
> That would explain some of my confusion.
> 
> I think we ought to double-check our terminology just to be sure we are talking about the same things.
> I'm using 'dispatch' to mean the last stage of the processor frontend (fetch/decode/dispatch possibly with other stages such as register renaming amongst them) which passes instructions on to split/unified reservation station(s). Dispatch is the last in-order stage before out-of-order execution begins. I'm then using 'issue' to mean an instruction being selected by a reservation station for execution in a pipeline and passed to it.

That makes sense. The machine model has an IssueWidth value that should really be called DispatchWidth by this terminology. It is really the number of micro-ops that can be dispatched per cycle.

> I would say that dispatching to reservation stations is fairly predictable since it's still in-order at that point (issue on the other hand is unpredictable). In the case of a unified reservation station, dispatch just passes the instructions to the only reservation station. For split reservation stations, it generally selects a reservation station for an instruction based on the opcode (e.g. adds/subs/shifts to one reservation station, loads/stores to another, fpu ops to another) and passes the instruction to it.
> 
Right.

The per-operand machine model does not impose any rules on the selection of processor resources. I could have 3 ALUs where 2 can shift:

def UnitA : ProcResource<3>;
def UnitAS : ProcResource<2> { let Super = UnitA; }

The scheduler can take any combination of: add, shift, shift without counting any stalls.

> The P5600 is using split reservation stations and its dispatch is predictable in most cases. When the instructions are statically routed, the pipelines in which they can execute are all under the same reservation station. It's not necessary to model which reservation station they were dispatched to in these cases because there's no choice. A small number of instructions are dynamically routed according to the number processed in a given cycle and the previous outcome of the decision. Once routed to a reservation station it is not possible to issue to all the pipelines that could potentially execute the instruction (e.g. AGQ cannot issue to ALU, only to AL2) and the decision cannot be reversed. For these dynamically routed instructions, it seems that the P5600 is in conflict with the current machine model. I'll look into resolving this with a MachineSchedStrategy first.

I agree, and think you're taking the right approach. In general, the logic for picking dispatch queues/reservation stations can be complicated and may depend on the current state/history of the queues. The MI scheduler with per-operand model takes the approach of underconstraining the schedule, then allowing you to write custom logic to add constraints. The Itineraries + HazardChecker take the opposite approach.

> Suppose we have the following assembly (for the sake of this example, I'm going to ignore the use of history in making the dispatch decisions):
> insn1: addu $1, $2, $3
> insn2: addu $4, $5, $6
> insn3: addu $1, $1, $7
> insn4: addu $4, $4, $8
> 
> Dispatch would receive this two instructions at a time over two cycles. In cycle t+0 it checks the opcodes of insn1 and insn2 and notes that both can be dispatched to either ALQ or AGQ. It can't send both to any one of these so it sends insn1 to ALQ and insn2 to AGQ. In cycle t+1, does the same thing and dispatches insn3 to ALQ and insn4 to AGQ.
> 
> ALQ receives insn1 at t+0. During t+1 it finds that insn1's dependencies are resolved and it is ready to issue. It issues insn1 to the only suitable pipeline under its control, ALU. Similarly, it receives insn3 at t+1 and issues it to ALU in t+2.
> 
> Meanwhile AGQ is doing the same thing with the instructions it receives. AGQ receives insn3 at t+0. During t+1 it finds that insn3's dependencies are resolved and it is ready to issue. It issues insn3 to the only suitable pipeline under its control, AL2. Similarly, it receives insn4 at t+1 and issues it to AL2 in t+2.
"insn2"

None of this is modeled. My thinking was that
(a) Hardware dispatches instructions to the least constrained queues/ports.
(b) Dispatch within a super-resource or group depends on state/history.
(c) Anyone who cares that much about scheduling should write their own strategy, piggybacking on the available infrastructure. Several targets have succesfully done that with minimal effort.

More generally, my agenda was to preserve IR order unless we have strong evidence that it is suboptimal. I believe in making codegen easier to debug.

-Andy

> 
>> <snip>
>> 
>> I did not realize you were using processor groups. For many (relatively
>> simple) cores the functional units can be expressed as a hierarchy. An
>> instruction either needs a specific unit, or it can be issued to some broader
>> class. You can do that without any groups. I added ProcResGroup for
>> SandyBridge because instructions can issue to some subset of ports, and
>> these subsets are overlapping. I think it is possible to use both groups and
>> super resources in the same model, but may cause to confusion. I was simply
>> suggesting something like this, for example:
>> 
>> <snip>
> 
> Ok, I've switched to this method of defining the hierarchy. I was following Haswell's example but I don't need overlapping subsets.
> 
>> The relationship between ALU2 and ACQ is not clear to me yet, so I'm not
>> sure what's intended in your example.
> 
> ALU2 is the issue port to one of the pipelines under the control of the AGQ reservation station. ALU2 is similar in principle to one of the HWPortX resources from the Haswell model, similarly AGQ corresponds to HWPortAny (except it's one of three reservation stations and not the only one).
> 
>> FYI: BufferSize is a nice feature, but you can fairly safely omit it for an OOO
>> code. The scheduler will by default assume an infinite dispatch queue and
>> almost certainly generate the same schedule unless you have very large
>> blocks! The scheduler does attempt to determine whether the OOO buffer
>> will reach capacity across iterations of single block loops, but it only looks at
>> the model's MicroOpBufferSize for this computation, not the per-resource
>> buffer size.
>> 
>> -Andy
> 
> That's a good point, block sizes tend to be small in most code. I'll have to look into the effect on heavily unrolled and vectorized code such as FFT/DCT where the blocks are likely to be large. 
>