[llvm-dev] [MachineScheduler] Question about IssueWidth / NumMicroOps

Wed May 9 09:43:27 PDT 2018

Hi,

I would like to ask what IssueWidth and NumMicroOps refer to in 
MachineScheduler, just to be 100% sure what the intent is.
Are we modeling the decoder phase or the execution stage?

Background:

First of all, there seems to be different meanings of "issue" depending 
on which platform you're on:

https://stackoverflow.com/questions/23219685/what-is-the-meaning-of-instruction-dispatch:
"... "Dispatch in this sense means either the sending of an instruction 
to a queue in preparation to be scheduled in an out-of-order
processor (IBM's use; Intel calls this issue) or sending the instruction 
to the functional unit for execution (Intel's use; IBM calls this issue)..."

So "issue" could mean either of
(1) "the sending of an instruction to a queue in preparation to be 
scheduled in an out-of-order processor"
(2) "sending the instruction to the functional unit for execution"

I would hope to be right when I think that IssueWidth (1) would relate 
to the decoding capacity, while (2) would reflect the executional
capacity per cycle.

There is this comment in TargetSchedule.td:

// Use BufferSize = 0 for resources that force "dispatch/issue
// groups". (Different processors define dispath/issue
// differently. Here we refer to stage between decoding into micro-ops
// and moving them into a reservation station.) Normally NumMicroOps
// is sufficient to limit dispatch/issue groups. However, some
// processors can form groups of with only certain combinitions of
// instruction types. e.g. POWER7.

This seems to say that in MachineScheduler, (1) is in effect, right?

Furthermore, I see

def SkylakeServerModel : SchedMachineModel {
// All x86 instructions are modeled as a single micro-op, and SKylake can
// decode 6 instructions per cycle.
    let IssueWidth = 6;

def BroadwellModel : SchedMachineModel {
// All x86 instructions are modeled as a single micro-op, and HW can 
decode 4
// instructions per cycle.
    let IssueWidth = 4;

def SandyBridgeModel : SchedMachineModel {
// All x86 instructions are modeled as a single micro-op, and SB can 
decode 4
// instructions per cycle.
// FIXME: Identify instructions that aren't a single fused micro-op.
    let IssueWidth = 4;

, which also seem to indicate (1).

What's more, I see that checkHazard() returns true if '(CurrMOps + uops 
 > SchedModel->getIssueWidth())'.
This means that the SU will be put in Pending instead of Available based 
on the number of microops it uses.
To me this seems like an in-order decoding hazard check, since an OOO 
machine will rearrange the microops
during execution, so there is not much use in checking for the sum of 
the executional capacity of the current SU
candidate and the immediately previously scheduled here. I then again 
would say (1). (Checking for decoder groups
pre-RA does BTW not make much sense on SystemZ, but that's another 
question).

checkHazard() also return hazard if

     (CurrMOps > 0 &&
       ((isTop() && SchedModel->mustBeginGroup(SU->getInstr())) ||
        (!isTop() && SchedModel->mustEndGroup(SU->getInstr()))))

, which also per the same lines makes me think that this is intended for 
the instruction stream management, or (1).

There is also the fact that

IsResourceLimited =
       checkResourceLimit(SchedModel->getLatencyFactor(), 
getCriticalCount(),
                          getScheduledLatency());

, which is to me admittedly hard to grasp, but it seems that the 
scheduled latency (std::max(ExpectedLatency, CurrCycle))
affects the resource heuristic so that if scheduled latency is low 
enough, it becomes active. This then means that CurrCycle
actually affects when resource balancing goes into action, and CurrCycle 
in turn is advanced when NumMicroOps reach the
IssueWidth. So somehow it all depends on modelling the instructions to 
fill upp the IssueWidth by their microops. This could
actually either be
* Decoder cycles: NumDecoderSlots(SU) => SU->NumMicroOps and 
DecoderCapacity => IssueWidth  (1)
or
* Execution cycles: NumExecutedUOps(SU) => SU->NumMicroOps and 
ApproxMaxExecutedUOpsPerCycle => IssueWidth (2)

They would at least in this context be somewhat equievalent in driving 
CurrCycle forward.

Please, let me know about (1) or (2)  :-)

thanks

/Jonas