[llvm-dev] [MachineScheduler] Question about IssueWidth / NumMicroOps

Wed May 9 15:46:36 PDT 2018

> On May 9, 2018, at 9:43 AM, Jonas Paulsson <paulsson at linux.vnet.ibm.com> wrote:
> 
> Hi,
> 
> I would like to ask what IssueWidth and NumMicroOps refer to in MachineScheduler, just to be 100% sure what the intent is.
> Are we modeling the decoder phase or the execution stage?
> 
> Background:
> 
> First of all, there seems to be different meanings of "issue" depending on which platform you're on:
> 
> https://stackoverflow.com/questions/23219685/what-is-the-meaning-of-instruction-dispatch:
> "... "Dispatch in this sense means either the sending of an instruction to a queue in preparation to be scheduled in an out-of-order
> processor (IBM's use; Intel calls this issue) or sending the instruction to the functional unit for execution (Intel's use; IBM calls this issue)..."
> 
> So "issue" could mean either of
> (1) "the sending of an instruction to a queue in preparation to be scheduled in an out-of-order processor"
> (2) "sending the instruction to the functional unit for execution"
> 
> I would hope to be right when I think that IssueWidth (1) would relate to the decoding capacity, while (2) would reflect the executional
> capacity per cycle.
> 
> There is this comment in TargetSchedule.td:
> 
> // Use BufferSize = 0 for resources that force "dispatch/issue
> // groups". (Different processors define dispath/issue
> // differently. Here we refer to stage between decoding into micro-ops
> // and moving them into a reservation station.) Normally NumMicroOps
> // is sufficient to limit dispatch/issue groups. However, some
> // processors can form groups of with only certain combinitions of
> // instruction types. e.g. POWER7.
> 
> This seems to say that in MachineScheduler, (1) is in effect, right?
> 
> Furthermore, I see
> 
> def SkylakeServerModel : SchedMachineModel {
> // All x86 instructions are modeled as a single micro-op, and SKylake can
> // decode 6 instructions per cycle.
>    let IssueWidth = 6;
> 
> def BroadwellModel : SchedMachineModel {
> // All x86 instructions are modeled as a single micro-op, and HW can decode 4
> // instructions per cycle.
>    let IssueWidth = 4;
> 
> def SandyBridgeModel : SchedMachineModel {
> // All x86 instructions are modeled as a single micro-op, and SB can decode 4
> // instructions per cycle.
> // FIXME: Identify instructions that aren't a single fused micro-op.
>    let IssueWidth = 4;
> 
> , which also seem to indicate (1).
> 
> What's more, I see that checkHazard() returns true if '(CurrMOps + uops > SchedModel->getIssueWidth())'.
> This means that the SU will be put in Pending instead of Available based on the number of microops it uses.
> To me this seems like an in-order decoding hazard check, since an OOO machine will rearrange the microops
> during execution, so there is not much use in checking for the sum of the executional capacity of the current SU
> candidate and the immediately previously scheduled here. I then again would say (1). (Checking for decoder groups
> pre-RA does BTW not make much sense on SystemZ, but that's another question).
> 
> checkHazard() also return hazard if
> 
>     (CurrMOps > 0 &&
>       ((isTop() && SchedModel->mustBeginGroup(SU->getInstr())) ||
>        (!isTop() && SchedModel->mustEndGroup(SU->getInstr()))))
> 
> , which also per the same lines makes me think that this is intended for the instruction stream management, or (1).
> 
> There is also the fact that
> 
> IsResourceLimited =
>       checkResourceLimit(SchedModel->getLatencyFactor(), getCriticalCount(),
>                          getScheduledLatency());
> 
> , which is to me admittedly hard to grasp, but it seems that the scheduled latency (std::max(ExpectedLatency, CurrCycle))
> affects the resource heuristic so that if scheduled latency is low enough, it becomes active. This then means that CurrCycle
> actually affects when resource balancing goes into action, and CurrCycle in turn is advanced when NumMicroOps reach the
> IssueWidth. So somehow it all depends on modelling the instructions to fill upp the IssueWidth by their microops. This could
> actually either be
> * Decoder cycles: NumDecoderSlots(SU) => SU->NumMicroOps and DecoderCapacity => IssueWidth  (1)
> or
> * Execution cycles: NumExecutedUOps(SU) => SU->NumMicroOps and ApproxMaxExecutedUOpsPerCycle => IssueWidth (2)
> 
> They would at least in this context be somewhat equievalent in driving CurrCycle forward.
> 
> Please, let me know about (1) or (2)  :-)
> 
> thanks
> 
> /Jonas

I'll first try to frame your question with the background philosophy, then give you my take, but other feedback and discussion is welcome.

The LLVM machine model is an abstract machine. A real micro-architecture can have any number of buffers, queues, and stages. Declaring that a given machine-independent abstract property corresponds to a specific physical property across all subtargets can't be done. That said, target maintainers still need to know how to relate the abstract to the physical. The target maintainer can then extend the abstract model with their own machine specific resources.

The abstract pipeline is built around the notion of an "issue point". This is merely a reference point for counting machine cycles. The primary goal of the scheduler is to simply know when enough "time" has passed between scheduling dependent instructions.

The physical machine will have pipeline stages that delay execution. The scheduler does not model those delays because they are irrelevant as long as they are consistent. Inaccuracies arise when instructions have different execution delays relative to each other, in addition to their intrinsic latency. To model those delays, the abstract model has various tools like ReadAdvance (bypassing) and the ability to extend the model with arbitrary "resources" and associate a cycle count with those resources for each instruction. (One tool currently missing is the ability to add a delay to ResourceCycles, but that would be easy to add).

Now we come to out-of-order execution, or, more generally, instruction buffers. Part of the CPU pipeline is always in-order. The issue point, which is the point of reference for counting cycles, only makes sense as an in-order part of the pipeline. Other parts of the pipeline are sometimes falling behind and sometimes catching up. It's only interesting to model those other, decoupled parts of the pipeline if they may be predictably resource constrained in a way that the scheduler can exploit.

The LLVM machine model distinguishes between in-order constraints and out-of-order constraints so that the target's scheduling strategy can apply appropriate heuristics. For a well-balanced CPU pipeline, out-of-order resources would not typically be treated as a hard scheduling constraint. For example, in the GenericScheduler, a delay caused by limited out-of-order resources is not directly reflected in the number of cycles that the scheduler sees between issuing an instruction and its dependent instructions. In other words, out-of-order resources don't directly increase the latency between pairs of instructions. However, they can still be used to detect potential bottlenecks across a sequence of instructions and bias the scheduling heuristics appropriately.

IssueWidth is meant to be a hard in-order constraint. We sometimes call this kind of constraint a "hazard"). In the GenericScheduler strategy, no more than IssueWidth micro-ops can ever be scheduled in a particular cycle. So, if an instruction sequence has enough ILP to exceed IssueWidth, that will immediately increase the currently scheduling cycle, and effectively bring dependent instructions into the ready queue earlier.

In practice, I think IssueWidth is useful to model to the bottleneck between the decoder (after micro-op expansion) and the out-of-order reservation stations. If the total number of reservation stations is also a bottleneck, or if any other pipeline stage has a bandwidth limitation, then that can be naturally modeled by adding an out-of-order processor resource.

> I would hope to be right when I think that IssueWidth (1) would relate to the decoding > capacity, while (2) would reflect the executional
> capacity per cycle.

I don't think IssueWidth necessarily has anything to do with instruction decoding or the execution capacity of functional units. I will say that we expect the decoding capacity to "keep up with" the issue width. If the IssueWidth property also serves that purpose for you, I think that's fine. In the case of the x86 machine models above, since each instruction is a micro-op, I don't see any useful distinction between decode bandwidth and in-order issue of micro-ops.

Some target maintainers may want to schedule for an OOO machine as if it were in-order. They are welcome to do that (and hopefully have plenty of architectural registers). The scheduling mode can be broadly selected with the infamous MicroOpBufferSize setting, or individual resources can be marked in-order with BufferSize=0. And, as always, I suggest writing your own scheduling strategy of you care that deeply about scheduling for the peculiarities of your machine.

(caveat: there may still be GenericScheduler implementation deficiencies because it is trying to support more scheduling features than we have in-tree targets).

Sorry, I don't have time to draw diagrams and tables. Hopefully you can makes sense of my long-form rambling.

Thanks for the question.

-Andy