[llvm-commits] [PATCH] 64 functional units

Fri Jun 22 11:23:01 PDT 2012

On Jun 19, 2012, at 3:07 PM, Hal Finkel <hfinkel at anl.gov> wrote:
> On Tue, 19 Jun 2012 12:33:10 -0700
> Andrew Trick <atrick at apple.com> wrote:
> 
>> 
>> On Jun 19, 2012, at 11:48 AM, Evan Cheng <evan.cheng at apple.com> wrote:
>> 
>>> Are you sure this is the right way to go? That's a lot of
>>> functional units and this change is probably increasing LLVM's
>>> memory foot print.
>>> 
>>> The InstrStage data structure is already poorly packed even before
>>> this change: unsigned Cycles_;  ///< Length of stage in machine
>>> cycles unsigned Units_;   ///< Choice of functional units
>>> int NextCycles_;   ///< Number of machine cycles to next stage
>>> ReservationKinds Kind_; ///< Kind of the FU reservation
>>> 
>>> We probably want to reduce the size of Cycles_ and NextCycles_ down
>>> to i16. This change is not helping. :-(
>> 
>>> On Jun 18, 2012, at 7:34 PM, Hal Finkel wrote:
>>>>> On Jun 13, 2012, at 7:04 AM, Hal Finkel wrote:
>>>>> 
>>>>>> Please review the attached patch which changes the datatype used
>>>>>> to hold the function-units bitmask from unsigned to uint64_t. In
>>>>>> order to describe some of the recent PowerPC chips (with all of
>>>>>> their relevant multi-stage pipelines), I need more than 32 FUs.
>> 
>> Hi Hal,
>> 
>> To address Evan's concerns I suggest...
>> 
>> 1) Explain why you really want to model more then 32 FUs in these
>> cores. The InstrStage descriptions are only needed for pipeline
>> resources that are guaranteed to generate a stall/pipeline bubble
>> when a conflict is present in the static schedule. Can you show that
>> modeling all of the types of FuncUnits actually improves performance?
> 
> Yes, it seems that way. On the other hand, there may be a more concise
> method.
> 
>> Just doing this for "completeness" is not a great justification,
> 
> I agree.
> 
>> since the ones that aren't included in the bit mask can be commented.
>> Also, sometimes using one funcunit implies another, so they can share
>> an itinerary unit.
> 
> Yes.
> 
> For concreteness, I've attached the preliminary itinerary that I've
> constructed for the POWER7 cores. As currently specified, it requires
> 34 functional units. This seems important for bottom-up scheduling,
> because while many of the pipelines have common dispatch stages, those
> stages forward the instructions into different, sometimes deep,
> pipelines. The point of modeling these is not only to get the relative
> latencies right, but also to avoid hazards from sharing the dispatch
> stages. If I can get the same expressive power with fewer functional
> units, I'll certainly be happy to use an alternate technique.
> 
> This itinerary is preliminary -- not because it needs more functional
> units ;) -- but because there needs to be better modeling of
> instructions which occupy (parts of) multiple pipelines simultaneously
> (such as the load/store with update instructions).
> 
> <PPCSchedulePwr7.td>

Ok. Sorry, I was terribly confused. I thought you were working on an embedded PPC core, not *the* POWER7 server chip.

Don't waste time measuring overhead. Please just revert your change to make Units unit64_t. You can still experiment with this on your own branch. But defining InstrStages is not the way we expect processors like POWER7 to be modeled in the scheduler.

I'll try to explain my understanding of the scheduler's machine model better...

I only see two constraints that I would model as "Hazards" in the MachineScheduler:

1) VMX floating point and simple and complex integer operations can only be dispatched to UQ0;
2) permute (PM), decimal floating point, and 128-bit store operations can only be dispatched to UQ1;

That requries two functional units!

The following statement refers to instruction sequencing unit resources, not pipeline resources:

  All resources such as the renames and various queue entries must be
  available for the instructions in a group before the group can be
  dispatched.

The scheduler doesn't model those resources. The best you can do is ensure that you have groups of six instructions that can issue in the same cycle.

Reservation tables aren't really usefull for a fully pipelined processor, and are simply wrong for an out-of-order processor. Two important points:

1) Pipelines that are fully pipelined don't need a reservation table (no InstrStage needed) because they don't block other instructions from issuing on the same pipe the following cycle.

2) Once the instructions are dispatched to the issue queues, there's no way for a reservation table to statically model what will happen in the highly dynamic out-of-order unit. Even if it were theoretically possible, I've never seen a microarchitecture spec that had enough information to do this. The scheduler would effectively have to simulate the OOO core *and* predict all variable latency operations. Not to mention the SMT issues that you have on this chip!

Operand latencies and bypasses *are* important. You can define those directly without defining any InstrStages. That will be more precise and you should see some benefit from it.

I'll add "ResourceUnits" to the itinerary so you can specify things like the number of load-store units (2).  That will prevent the MachineScheduler from jamming too many loads or store at one end of the schedule. I'll add that heuristic after adding ResourceUnits.

I know you're still tuning on the existing PostRA scheduler. That's fine but I think you'll get more benefit from the MachineScheduler as it matures. Rather than adding features to PostRA scheduling, we want to simplify it to the point where it's minimal and conservative. With very little spilling and out-of-order execution, you shouldn't really need PostRA scheduling.

-Andy