[llvm-commits] [PATCH] 64 functional units

Fri Jun 22 13:26:49 PDT 2012

On Jun 22, 2012, at 12:50 PM, Hal Finkel <hfinkel at anl.gov> wrote:
> On Fri, 22 Jun 2012 11:23:01 -0700
> Andrew Trick <atrick at apple.com> wrote:
>> 
>> I only see two constraints that I would model as "Hazards" in the
>> MachineScheduler:
>> 
>> 1) VMX floating point and simple and complex integer operations can
>> only be dispatched to UQ0; 2) permute (PM), decimal floating point,
>> and 128-bit store operations can only be dispatched to UQ1;
>> 
>> That requries two functional units!
>> 
>> The following statement refers to instruction sequencing unit
>> resources, not pipeline resources:
>> 
>>  All resources such as the renames and various queue entries must be
>>  available for the instructions in a group before the group can be
>>  dispatched.
>> 
>> The scheduler doesn't model those resources. The best you can do is
>> ensure that you have groups of six instructions that can issue in the
>> same cycle.
> 
> Top down, I agree with you. But I think that there are other constraints
> to take into account, especially from a bottom-up perspective.
> 
> When scheduling bottom up, having the extra pipelines naturally tells
> the scheduler how many of which kind of instructions can complete in
> any given cycle, and which instructions would have a resource conflict
> because of the shared dispatch stages. While it is true that the
> instructions might not actually issue in the order predicted, having a
> "realistic" instruction spacing should, statistically, give better
> performance.
> 
> Is this the wrong way to look at it?

A "realistic" instruction spacing places less demands on the out-of-order unit. It's a good thing. To that end, operand latencies should help the scheduler. It may also help to know which resources tend to saturate. For example, you only have 2 Ld/St units, but could sustain 6 instrs/cycle. If we have 20 Ld/St instrs, it may not be a good idea to scheduling them contiguously because the issue queue may fill up (or other resources that can cause a stall). Instead we want to interleave some non Ld/St or some other high latency operations if we have them. A reservation table is a bad way to do this because it makes a strong assumption about what issues in each cycle. Then it either gives you a hard constraint that trumps all other heuristics, or totally misses the issue. By trumping real constraints with fake constraints, the reservation table hurts performance.

I don't quite understand the top-down vs. bottom-up problem. Regardless of how you get there, I think the goal of the scheduler is, in order of decreasing priority:

1) avoiding dispatch hazards
2) reducing register pressure if it's too high
3) balancing critical path and resources height

So either way, forming dispatch groups is the top priority and those groups get dispatched to pipelines top-down.

Whether top-down or bottom-up is a better way to achieve that has to do with the greedy nature of the scheduler. If you have more constraints at the bottom of the instruction stream, bottom-up is better and vice-versa...

>> Reservation tables aren't really usefull for a fully pipelined
>> processor, and are simply wrong for an out-of-order processor. Two
>> important points:
>> 
>> 1) Pipelines that are fully pipelined don't need a reservation table
>> (no InstrStage needed) because they don't block other instructions
>> from issuing on the same pipe the following cycle.
>> 
>> 2) Once the instructions are dispatched to the issue queues, there's
>> no way for a reservation table to statically model what will happen
>> in the highly dynamic out-of-order unit. Even if it were
>> theoretically possible, I've never seen a microarchitecture spec that
>> had enough information to do this. The scheduler would effectively
>> have to simulate the OOO core *and* predict all variable latency
>> operations. Not to mention the SMT issues that you have on this chip!
> 
> I am not so much worried about static prediction as I am about roughly
> matching the internal constraints during bottom-up scheduling. I would
> expect that top-down scheduling is all about group formation.
> 
>> 
>> Operand latencies and bypasses *are* important. You can define those
>> directly without defining any InstrStages. That will be more precise
>> and you should see some benefit from it.
> 
> Okay, sounds good.
> 
>> 
>> I'll add "ResourceUnits" to the itinerary so you can specify things
>> like the number of load-store units (2).  That will prevent the
>> MachineScheduler from jamming too many loads or store at one end of
>> the schedule. I'll add that heuristic after adding ResourceUnits.
> 
> Cool.
> 
>> 
>> I know you're still tuning on the existing PostRA scheduler. That's
>> fine but I think you'll get more benefit from the MachineScheduler as
>> it matures. Rather than adding features to PostRA scheduling, we want
>> to simplify it to the point where it's minimal and conservative. With
>> very little spilling and out-of-order execution, you shouldn't really
>> need PostRA scheduling.
> 
> Then I'll turn it off on the server chips once everything is in place.
> It might still help for the embedded cores. Do you agree?

In any case, we need to develop an alternative to spill code placement first. My point was not to drop it yet, but that aggressive tuning here may not be very productive in the long term.

Yes, even long term, you want to rerun postRA scheduling for an in-order cpu. But hopefully it doesn't have to be as aggressive as it is now.

-Andy