[LLVMdev] New machine model questions

Tue Jan 28 09:22:46 PST 2014

From: Andrew Trick [mailto:atrick at apple.com]
Sent: 24 January 2014 21:52
To: Daniel Sanders
Cc: LLVM Developers Mailing List (llvmdev at cs.uiuc.edu)
Subject: Re: New machine model questions

On Jan 24, 2014, at 2:21 AM, Daniel Sanders <Daniel.Sanders at imgtec.com<mailto:Daniel.Sanders at imgtec.com>> wrote:

Hi Andrew,

I seem to be making good progress on the P5600 scheduler using the new machine model but I've got a few questions about it.

Hi Daniel,

These are really good questions. For future reference, I might provide better examples if you attach what you have so far for the model.

How would you represent an instruction that splits into two micro-ops and is dispatched to two different reservation stations?
For example, I have two reservation stations (AGQ and FPQ). An FPU load instruction is split into a load micro-op which is dispatched to AGQ and a writeback micro-op which is dispatched to FPQ.
The AGQ micro-op is issued to a four-cycle latency pipeline called LDST. Three cycles after issue, the LDST pipeline wakes up the FPQ micro-op, which writes the result of the load back to the register file.

This question illustrates the primary difference between the per-operand machine model and the itinerary. The itinerary directly models the stages of each pipeline independently. Some backend maintainers may still want to use itineraries if that level of precision is critical [1]. Another option is extending the new model. [2]

I will assume that each queue is fully pipelined (4 ACQ ops can be in-flight).

Forcing all this information into a single SchedWriteRes def would look like this:

def P5600FLD : SchedWriteRes <[P5600UnitAGQ, P5600UnitFP]> {
  let Latency = 5; // 4 cycle load + 1 cycle FP writeback
  let NumMicroOps = 2;
}

This is bad (for an in-order processor) because it prevents FPLoad + FPx from being scheduled in the same cycle and fails to detect a conflict on FP ops 5 scheduled cycles ahead.

A better way to express it would be:

def P5600LD <[P5600UnitAGQ]> { let Latency = 4; }
def P5600FP <[P5600UnitFP]>;

def P5600FLD : WriteSequence<[P5600LD, P5600FP]>;

Unfortunately, the implementation currently aggregates the processor resources, ignoring the fact that they are used on different cycles. This is totally fixable [2]. However, I don't know why you would care, since an out-of-order processor doing its job will make the stalls unpredictable either way.

Thanks. I'll start with the WriteSequence method and see if testing shows that I need to go any further or not.

The two reservation stations don't seem to be completely independent of each other for these split instructions. The wakeup signal used to wakeup the second micro-op seems to be a demand that the micro-op issues in that cycle rather than permission to issue when it's convenient.

Is it possible to use other instructions already scheduled for the same cycle as part of the evaluation of a SchedPredicate in a SchedVariant?
I've got a class of instructions (mostly simple addition) that can dispatch to two different reservation stations (ALQ and AGQ), both of which have a suitable pipeline with the same latency. The dispatch stage can dispatch two instructions per cycle. When it has one instruction from this class it dispatches it to ALQ (this isn't strictly true but I'll come back to that), and when it has two it dispatches one to ALQ and the other to AGQ.

No. The machine model is used to form a scheduling DAG independent of the original schedule. If it's important to be this precise, then I suggest you plugin a new MachineSchedStrategy where you can model stalls for any special cases during scheduling.

You need a super-resource:

def P5600A : ProcResource<2>;
def P5600AGQ : ProcResource<1> { let Super = P5600A; }
def P5600ALQ : ProcResource<1> { let Super = P5600A; }

I'll take a look at MachineSchedStrategy. I don't know how important that precision is likely to be at the moment but I've generally found that the more accurate the machine description is, the harder it is to find one of the bad cases. That experience comes from a particular in-order scheduler in a proprietary compiler so I don't know if I can expect similar things from LLVM or not. I'm expecting out-of-order to help reduce the amount of precision that's needed for a good result but I don't know how much of a reduction I can expect at the moment.

I'm not sure I fully understand the super-resource suggestion. I've attached my WIP so you can take a look at the code in context but the relevant extracts are below.
def P5600IssueALU : ProcResource<1>;
def P5600IssueAL2 : ProcResource<1>;
def P5600ALQ : ProcResGroup<[P5600IssueALU]> { let BufferSize = 16; }
def P5600AGQ : ProcResGroup<[P5600IssueAL2, ...]> {
  let BufferSize = 16;
}
def P5600WriteALU : SchedWriteRes<[P5600IssueALU]>;
def P5600WriteAL2 : SchedWriteRes<[P5600IssueAL2]>;
def P5600WriteEitherALU : SchedWriteVariant<
  [SchedVar<SchedPredicate<[{1}]>, [P5600WriteALU]>, // FIXME: Predicate
   SchedVar<SchedPredicate<[{0}]>, [P5600WriteAL2]>  // FIXME: Predicate
  ]>;

I believe you are suggesting that I change this to:
def P5600IssueEitherALU : ProcResource<2>;
def P5600IssueALU : ProcResource<1> { let Super = P5600IssueEitherALU; }
def P5600IssueAL2 : ProcResource<1> { let Super = P5600IssueEitherALU; }
def P5600ALQ : ProcResGroup<[P5600IssueALU]> { let BufferSize = 16; }
def P5600AGQ : ProcResGroup<[P5600IssueAL2, ...]> {
  let BufferSize = 16;
}
def P5600WriteALU : SchedWriteRes<[P5600IssueALU]>;
def P5600WriteAL2 : SchedWriteRes<[P5600IssueAL2]>;
def P5600WriteEitherALU : SchedWriteRes<[P5600IssueEitherALU]>;

Instructions can then use P5600WriteEitherALU to pick between the two sub-resources at issue time. One curious consequence of this is that by allowing it to pick which pipeline the instruction is issued to, it effectively allows the instruction to pick which reservation station to be dispatched to at issue-time (which is backwards, normally dispatch determines the available subset of pipelines). That might not be a significant issue as far as the scheduler output is concerned but it seemed strange to me and it makes me doubt that I've fully understood it.

One thing about the attached WIP. I'm using ItinRW and InstRW at the moment but I'm planning on migrating the ItinRW's to InstRW. The reason I'm not using the Sched<> class on each instruction is that I'm not confident that there is a common set of SchedReadWrite def's that would make sense on the full range of MIPS processor implementations. I'm going to have another think about this once I'm nearer a complete scheduler for P5600.

Is it possible to use historical scheduling decisions as part of the evaluation of a SchedPredicate in a SchedVariant?
I'm fairly certain the answer to this one is 'no' (because scheduling can be performed in both directions) but I'll ask anyway. In the previous question, I said that when the dispatch stage has one instruction that can be dispatched to either ALQ or AGQ it always picks ALQ. The truth of the matter is that historical decisions are used to guess which one is most likely to stall and the dispatch stage picks the other one. I haven't established exactly what information it's using yet though so I can't give a good example.

SchedVariant is really just for opcodes that can use different resources/latency depending on the value of some immediate.

The kind of micro-architectural special rules/heuristics that you are describing are exactly why we have a plugable MachineSchedStrategy.

That makes sense.

Is there an easy way to check I've covered every valid instruction? I'm thinking it would be helpful if I could get build warnings from tablegen about valid instructions with no scheduling information. This would also prevent someone adding an instruction later and forgetting to add it to the scheduler.

YES! Very good question.

When implementing a new model, it's important to run table-gen with subtarget-emitter.

You should be able to touch your .td, then grab the command via make TOOL_VERBOSE=1

This is the line from ARM:

llvm-tblgen -I /s/fix/lib/Target/ARM -I /s/fix/include -I  /s/fix/include -I /s/fix/lib/Target -gen-subtarget -o  ARMGenSubtargetInfo.inc /s/fix/lib/Target/ARM/ARM.td -debug-only=subtarget-emitter

It will list all instructions and print "No machine model for <subtarget>"
You will also get an assert in the scheduler, unless you add the following flag to your mode:

  let CompleteModel = 0;

That's perfect, thanks.

Thanks

Daniel Sanders
Leading Software Design Engineer, MIPS Processor IP
Imagination Technologies Limited
www.imgtec.com<http://www.imgtec.com/>

[1] I added support for the itineraries into the new MI scheduler because I realized that some out-of-tree backend maintainers may still want that level of precision. I'm not sure yet whether you fall into that category. The new machine model was designed for out-of-order processors, but I also think it is sufficient for most in-order models. I would like to establish the new machine model as the preferred choice because it is simpler and more efficient, it will be easier for most backend developers to bring up a new subtarget, and we will then eventually have more consistency across targets. I also selfishly want more good in-tree examples of the new model so it will effectively be better documented and supported.

I believe it is possible to handle special cases requiring the itinerary's precision without using an itinerary by either pluging custom logic into the MachineSchedStrategy, or extending the new machine model...

[2] To model in-order pipeline resource we could

- add a field to MCWriteProcResEntry
  + unsigned DelayCycles;

- Modify the table gen code in SubtargetEmitter to record the delay.

  We already to this:
       // If this resource is already used in this sequence, add the current
       // entry's cycles so that the same resource appears to be used
       // serially, rather than multiple parallel uses. This is important for
       // in-order machine where the resource consumption is a hazard.

  But we could do also add a delay to the resource cycles when the the
  processor resource is unbuffered.

- The code in SchedBoundary::bumpNode and SchedBoundary::checkHazard
  needs to be updated to increment the cycle accounting for DelayCycles.

-Andy

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140128/76c69b5b/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: MipsScheduleP5600.td
Type: application/octet-stream
Size: 12634 bytes
Desc: MipsScheduleP5600.td
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140128/76c69b5b/attachment.obj>