[llvm-commits] [PATCH] 64 functional units

Fri Jun 22 12:50:37 PDT 2012

On Fri, 22 Jun 2012 11:23:01 -0700
Andrew Trick <atrick at apple.com> wrote:

> On Jun 19, 2012, at 3:07 PM, Hal Finkel <hfinkel at anl.gov> wrote:
> > On Tue, 19 Jun 2012 12:33:10 -0700
> > Andrew Trick <atrick at apple.com> wrote:
> > 
> >> 
> >> On Jun 19, 2012, at 11:48 AM, Evan Cheng <evan.cheng at apple.com>
> >> wrote:
> >> 
> >>> Are you sure this is the right way to go? That's a lot of
> >>> functional units and this change is probably increasing LLVM's
> >>> memory foot print.
> >>> 
> >>> The InstrStage data structure is already poorly packed even before
> >>> this change: unsigned Cycles_;  ///< Length of stage in machine
> >>> cycles unsigned Units_;   ///< Choice of functional units
> >>> int NextCycles_;   ///< Number of machine cycles to next stage
> >>> ReservationKinds Kind_; ///< Kind of the FU reservation
> >>> 
> >>> We probably want to reduce the size of Cycles_ and NextCycles_
> >>> down to i16. This change is not helping. :-(
> >> 
> >>> On Jun 18, 2012, at 7:34 PM, Hal Finkel wrote:
> >>>>> On Jun 13, 2012, at 7:04 AM, Hal Finkel wrote:
> >>>>> 
> >>>>>> Please review the attached patch which changes the datatype
> >>>>>> used to hold the function-units bitmask from unsigned to
> >>>>>> uint64_t. In order to describe some of the recent PowerPC
> >>>>>> chips (with all of their relevant multi-stage pipelines), I
> >>>>>> need more than 32 FUs.
> >> 
> >> Hi Hal,
> >> 
> >> To address Evan's concerns I suggest...
> >> 
> >> 1) Explain why you really want to model more then 32 FUs in these
> >> cores. The InstrStage descriptions are only needed for pipeline
> >> resources that are guaranteed to generate a stall/pipeline bubble
> >> when a conflict is present in the static schedule. Can you show
> >> that modeling all of the types of FuncUnits actually improves
> >> performance?
> > 
> > Yes, it seems that way. On the other hand, there may be a more
> > concise method.
> > 
> >> Just doing this for "completeness" is not a great justification,
> > 
> > I agree.
> > 
> >> since the ones that aren't included in the bit mask can be
> >> commented. Also, sometimes using one funcunit implies another, so
> >> they can share an itinerary unit.
> > 
> > Yes.
> > 
> > For concreteness, I've attached the preliminary itinerary that I've
> > constructed for the POWER7 cores. As currently specified, it
> > requires 34 functional units. This seems important for bottom-up
> > scheduling, because while many of the pipelines have common
> > dispatch stages, those stages forward the instructions into
> > different, sometimes deep, pipelines. The point of modeling these
> > is not only to get the relative latencies right, but also to avoid
> > hazards from sharing the dispatch stages. If I can get the same
> > expressive power with fewer functional units, I'll certainly be
> > happy to use an alternate technique.
> > 
> > This itinerary is preliminary -- not because it needs more
> > functional units ;) -- but because there needs to be better
> > modeling of instructions which occupy (parts of) multiple pipelines
> > simultaneously (such as the load/store with update instructions).
> > 
> > <PPCSchedulePwr7.td>
> 
> Ok. Sorry, I was terribly confused. I thought you were working on an
> embedded PPC core, not *the* POWER7 server chip.

Primarily, I am working with the A2 embedded core. But we also have a
bunch of POWER7 boxes here (that's mostly where I build, run the test
suite, etc.), and I thought it would be nice to have instruction
scheduling for that chip as well. 

> 
> Don't waste time measuring overhead. Please just revert your change
> to make Units unit64_t.

Fair enough; will do.

> You can still experiment with this on your
> own branch. But defining InstrStages is not the way we expect
> processors like POWER7 to be modeled in the scheduler.
> 
> I'll try to explain my understanding of the scheduler's machine model
> better...
> 
> I only see two constraints that I would model as "Hazards" in the
> MachineScheduler:
> 
> 1) VMX floating point and simple and complex integer operations can
> only be dispatched to UQ0; 2) permute (PM), decimal floating point,
> and 128-bit store operations can only be dispatched to UQ1;
> 
> That requries two functional units!
> 
> The following statement refers to instruction sequencing unit
> resources, not pipeline resources:
> 
>   All resources such as the renames and various queue entries must be
>   available for the instructions in a group before the group can be
>   dispatched.
> 
> The scheduler doesn't model those resources. The best you can do is
> ensure that you have groups of six instructions that can issue in the
> same cycle.

Top down, I agree with you. But I think that there are other constraints
to take into account, especially from a bottom-up perspective.

When scheduling bottom up, having the extra pipelines naturally tells
the scheduler how many of which kind of instructions can complete in
any given cycle, and which instructions would have a resource conflict
because of the shared dispatch stages. While it is true that the
instructions might not actually issue in the order predicted, having a
"realistic" instruction spacing should, statistically, give better
performance.

Is this the wrong way to look at it?

> 
> Reservation tables aren't really usefull for a fully pipelined
> processor, and are simply wrong for an out-of-order processor. Two
> important points:
> 
> 1) Pipelines that are fully pipelined don't need a reservation table
> (no InstrStage needed) because they don't block other instructions
> from issuing on the same pipe the following cycle.
> 
> 2) Once the instructions are dispatched to the issue queues, there's
> no way for a reservation table to statically model what will happen
> in the highly dynamic out-of-order unit. Even if it were
> theoretically possible, I've never seen a microarchitecture spec that
> had enough information to do this. The scheduler would effectively
> have to simulate the OOO core *and* predict all variable latency
> operations. Not to mention the SMT issues that you have on this chip!

I am not so much worried about static prediction as I am about roughly
matching the internal constraints during bottom-up scheduling. I would
expect that top-down scheduling is all about group formation.

> 
> Operand latencies and bypasses *are* important. You can define those
> directly without defining any InstrStages. That will be more precise
> and you should see some benefit from it.

Okay, sounds good.

> 
> I'll add "ResourceUnits" to the itinerary so you can specify things
> like the number of load-store units (2).  That will prevent the
> MachineScheduler from jamming too many loads or store at one end of
> the schedule. I'll add that heuristic after adding ResourceUnits.

Cool.

> 
> I know you're still tuning on the existing PostRA scheduler. That's
> fine but I think you'll get more benefit from the MachineScheduler as
> it matures. Rather than adding features to PostRA scheduling, we want
> to simplify it to the point where it's minimal and conservative. With
> very little spilling and out-of-order execution, you shouldn't really
> need PostRA scheduling.

Then I'll turn it off on the server chips once everything is in place.
It might still help for the embedded cores. Do you agree?

Thanks again,
Hal

> 
> -Andy

-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory