[LLVMdev] Cheap OOO cycle estimation (was [RFC] Iterrative compilation framework for Clang/LLVM)

Wed Jan 1 17:43:21 PST 2014

On Jan 1, 2014, at 4:41 PM, Hal Finkel <hfinkel at anl.gov> wrote:

> ----- Original Message -----
>> From: "Andrew Trick" <atrick at apple.com>
>> To: "Hal Finkel" <hfinkel at anl.gov>
>> Cc: "Radovan Obradovic" <Radovan.Obradovic at imgtec.com>, "llvmdev at cs.uiuc.edu Dev" <llvmdev at cs.uiuc.edu>
>> Sent: Monday, December 23, 2013 4:41:00 PM
>> Subject: Re: [LLVMdev] [RFC] Iterrative compilation framework for Clang/LLVM
>> 
>> 
>> 
>> 
>> On Dec 16, 2013, at 4:26 PM, Hal Finkel < hfinkel at anl.gov > wrote:
>> 
>> 
>> 
>> 
>> At the end of each iteration, quality of generated code is estimated
>> by
>> executing newly introduced target dependent pass. Based on results
>> path for
>> the following iteration is calculated. At the moment, this has been
>> proved
>> for MIPS only and it is based on code size.
>> 
>> I think that, in general, we ought to be able to get an overall
>> latency estimate from the instruction scheduler.
>> 
>> This is somewhat tangential to the thread… I think it would be great
>> to have a static performance estimator. I’m thinking of an MI-level
>> library that simulates a loop using the machine model until it
>> settles on an average number of cycles per iteration. A simple tool
>> could handle the most probable path through the loop. It could be
>> extended to handle averaging across a small number of paths if
>> needed. Estimating in-order cycles is easy enough. I also developed
>> simplified model for estimating OOO core cycles
> 
> Can you describe the model? I'm curious how you did this.

Here is a rough sketch of the simulation, if you can call it that. This is not a pipeline simulator. Instead, we're estimating the impact of saturating OOO resources on the schedule. We can get some idea of how many loop iterations before pipeline stalls occur. If OOO resources are not saturated, then we're left with two cycle counts: first the time to decode and dispatch the micro-ops, second the total latency of scheduled instructions.

(I'm writing this up from my notes, so I won't claim correctness. This is just to give you the idea.)

** Keep track of three points on a timeline, in increasing order:

CurrCycle = (DispatchedMOps / Width) + Stalls
  "Cycles to decode and dispatch scheduled instructions
   into reservation stations, plus guaranteed pipeline stalls."

ExecutionCycles = max ResourceCycles of ready instructions over all resources.
  "Minimum cycles to execute ready instructions."

CriticalCycles = max ResourceCycles of scheduled instructions over all resources.
  "Minimum cycles to execute all scheduled instructions."

** Some state for book-keeping:

  BufferedMOps "queue of buffered micro-ops waiting to retire"

  BufferedResOps[NRes] "queue of buffered micro-ops waiting to execute"

  Executed "count of executed but not retired micro-ops"

  Retired "count of retired micro-ops"

** Very simple modeling of the reorder buffer/register rename pool:

*** After scheduling an instruction, I:
  if ReadyCycle(I) <= CurrCycle:
    ++Retired
  else
    BufferedMOps.push(ReadyCycle(I))

*** After bumping CurrCycle
  for T in BufferedMOps if T <= CurrCycle:
    ++Retired
    BufferedMOps.erase(T)

** Model the reservation stations:

*** After scheduling an instruction, I:
  if ReadyCycle(I) <= ExecutionCycles:
    ++Executed
    ResourceCycles[R] += CyclesPerResource(R)
    ExecutionCycles = max ResourceCycles[R], ExecutionCycles
  else for R in Resources(I):
    BufferedResOps[R].push(ReadyCycle(I))

*** After bumping ExecutionCycles:
  for T in BufferedResOps[R] if T <= ExecutionCycles:
    ++Executed
    ResourceCycles[R] += CyclesPerResource(R)
    ExecutionCycles = max ResourceCycles[R], ExecutionCycles

** Check for stalls

CurrCycles is incremented whenever we detect stalls that result from reaching the limit of the MicroOpBufferSize or the BufferSize of one of the processor resources, as described in the machine model.

ExecutionCycles is updated in a feedback loop. Each time an instruction becomes ready, it adds to the cycles needed to exececute based on the instruction's resources. ExecutionCycles is the maximum cycles to execute instructions based on some saturated resource. When ExecutionCycles increases, more instructions become ready, and so on.

CriticalCycles is just an estimate of the total latency to execute. It's a useful metric but doesn't factor into the simulation.

Lastly, we can factor in the overal impact of stalls by maintaining the invariant at all times:

CurrCycles <= ExecutionCycles <= CriticalCycles

-Andy
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140101/a4aef419/attachment.html>