[llvm-commits] [PATCH] 64 functional units

Tue Jun 19 15:07:44 PDT 2012

On Tue, 19 Jun 2012 12:33:10 -0700
Andrew Trick <atrick at apple.com> wrote:

> 
> On Jun 19, 2012, at 11:48 AM, Evan Cheng <evan.cheng at apple.com> wrote:
> 
> > Are you sure this is the right way to go? That's a lot of
> > functional units and this change is probably increasing LLVM's
> > memory foot print.
> > 
> > The InstrStage data structure is already poorly packed even before
> > this change: unsigned Cycles_;  ///< Length of stage in machine
> > cycles unsigned Units_;   ///< Choice of functional units
> >  int NextCycles_;   ///< Number of machine cycles to next stage
> >  ReservationKinds Kind_; ///< Kind of the FU reservation
> > 
> > We probably want to reduce the size of Cycles_ and NextCycles_ down
> > to i16. This change is not helping. :-(
> 
> > On Jun 18, 2012, at 7:34 PM, Hal Finkel wrote:
> >>> On Jun 13, 2012, at 7:04 AM, Hal Finkel wrote:
> >>> 
> >>>> Please review the attached patch which changes the datatype used
> >>>> to hold the function-units bitmask from unsigned to uint64_t. In
> >>>> order to describe some of the recent PowerPC chips (with all of
> >>>> their relevant multi-stage pipelines), I need more than 32 FUs.
> 
> Hi Hal,
> 
> To address Evan's concerns I suggest...
> 
> 1) Explain why you really want to model more then 32 FUs in these
> cores. The InstrStage descriptions are only needed for pipeline
> resources that are guaranteed to generate a stall/pipeline bubble
> when a conflict is present in the static schedule. Can you show that
> modeling all of the types of FuncUnits actually improves performance?

Yes, it seems that way. On the other hand, there may be a more concise
method.

> Just doing this for "completeness" is not a great justification,

I agree.

> since the ones that aren't included in the bit mask can be commented.
> Also, sometimes using one funcunit implies another, so they can share
> an itinerary unit.

Yes.

For concreteness, I've attached the preliminary itinerary that I've
constructed for the POWER7 cores. As currently specified, it requires
34 functional units. This seems important for bottom-up scheduling,
because while many of the pipelines have common dispatch stages, those
stages forward the instructions into different, sometimes deep,
pipelines. The point of modeling these is not only to get the relative
latencies right, but also to avoid hazards from sharing the dispatch
stages. If I can get the same expressive power with fewer functional
units, I'll certainly be happy to use an alternate technique.

This itinerary is preliminary -- not because it needs more functional
units ;) -- but because there needs to be better modeling of
instructions which occupy (parts of) multiple pipelines simultaneously
(such as the load/store with update instructions).

> 
> FYI: I'm planning to introduce a new type of of Resource in the
> itinerary that the scheduler can try to avoid oversubscribing without
> modeling them with a reservation table. i.e. they don't take a bit in
> the bit mask. But this is really for OOO cores and may not be
> relevant.
> 
> 2) Measure this increase in itinerary size. Can you report the
> increase in the size of the ARM tables?
> 
> 3) Measure scheduling time. Is there any noticeable impact on either
> SD or PostRA scheduling in ARM?
> 
> This shouldn't be too hard since you should be building ARM target
> anyway and you don't need to run the generated code. But if you need
> help let me know.

Good idea. I'll do these and post the results.

Thanks again,
Hal

> 
> Thanks,
> Andy

-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory
-------------- next part --------------
A non-text attachment was scrubbed...
Name: PPCSchedulePwr7.td
Type: application/octet-stream
Size: 12798 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20120619/d248d8c6/attachment.obj>