<html><head><meta http-equiv="Content-Type" content="text/html; charset=us-ascii"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class=""><br class=""><div><br class=""><blockquote type="cite" class=""><div class="">On Oct 25, 2017, at 10:50 AM, Stefan Pintilie <<a href="mailto:stefanp@ca.ibm.com" class="">stefanp@ca.ibm.com</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><font size="2" face="sans-serif" class="">Hi, </font><br class=""><br class=""><font size="2" face="sans-serif" class="">I've been trying to determine how to
properly set the MicroOpBufferSize parameter. Based on the documentation
in the header file:<br class=""> // "> 1" means the processor is out-of-order. This
is a machine independent</font><br class=""><font size="2" face="sans-serif" class=""> // estimate of highly machine
specific characteristics such as the register</font><br class=""><font size="2" face="sans-serif" class=""> // renaming pool and reorder
buffer.</font><br class=""><font size="2" face="sans-serif" class="">Power 9 is out-of-order and so it makes
sense to use a value greater than 1. However we don't quite have a reorder
buffer in the theoretical sense where we can pick any instruction in the
buffer to dispatch next. As a result, I'm looking for a better understanding
of how this parameter is used in the scheduler.</font><br class=""><br class=""><font size="2" face="sans-serif" class="">I've noticed that the only place where
it matters if the value of MicroOpBufferSize is 2 or 200 is in GenericScheduler::checkAcyclicLatency().
In other places in the code we only care if we have values that fall in
one of the three categories: 0, 1, or > 1. </font><br class=""><font size="2" face="sans-serif" class="">Here is my understanding of what that
function does (and I could be very wrong on this one) and please correct
me if I am wrong. </font><br class=""></div></blockquote><div><br class=""></div><div>MicroOpBufferSize is a confusing machine parameter. Most backends only</div><div>need to tell generic machine scheduler whether to do strict VLIW-style</div><div>in-order (=0), heuristic in-order (=1), or out-of-order (>1). We</div><div>should have some kind of alias for those modes so that backends don't</div><div>need to explicitly mention the buffer size (patches please).</div><div><br class=""></div><div>As far as picking the buffer size value for OOO hardware, it is meant</div><div>to estimate very roughly the number of instructions in-flight before</div><div>the cpu either runs out of physical registers or the number of entries</div><div>in some queue of instructions waiting for retirement (reoder buffer).</div><div><br class=""></div><div>The generic machine scheduler doesn't try to simulate the reorder</div><div>buffer because it is pointless to do so within the scope of a single</div><div>block on any modern hardware. The only time I've seen a case where the</div><div>compiler could determine that OOO resources would be overutilized is</div><div>in floating point loop kernel characterized by long latency operations and</div><div>interesting cyclic dependencies.</div><div><br class=""></div><div>Aside from the one heuristic where it's used, some backends may just</div><div>want to and express this level of detail about their microarchitecture</div><div>for documentation purpose and future use. I imagine a separate static</div><div>performance tool that *would* simulate the reservation stations and</div><div>reorder buffer. It would be nice if someone would write such a thing</div><div>based on LLVM.</div><div><br class=""></div><div>Back to the scheduler... the generic machine scheduler has two purposes:</div><div><br class=""></div><div>(1) A quick way to bootstrap a backend with at least some reasonable</div><div>(hopefully do no harm) scheduling.</div><div><br class=""></div><div>(2) An example of how to piece together pieces of the scheduling</div><div>infrastructure into a custom scheduler, with in-tree testing of those</div><div>shared pieces.</div><div><br class=""></div><div>You're far beyond the point of using the generic scheduler as a black</div><div>box. I suggest running benchmarks with and without the</div><div>checkAcyclicLatency heuristics to see which loops it helps/hurts and</div><div>possibly writing your own heuristic here.</div><div><br class=""></div><blockquote type="cite" class=""><div class=""><br class=""><font size="2" face="sans-serif" class="">We can have two critical paths through
a loop. One is cyclic (use-def cross loop iterations) and one is acyclic
(everything can be computed in one iteration). If the acyclic path is greater
than the cyclic path by more than the size of the instruction buffer then
we have to make sure that we don't run out of instructions to dispatch
in the instruction buffer. So, we set a flag Rem.IsAcyclicLatencyLimited
and then tryCandidate checks this flag when making decisions. </font><br class=""><font size="2" face="sans-serif" class="">Is this correct? </font><br class=""><font size="2" face="sans-serif" class="">If that is what we are doing here then
I think the proper value for this parameter is the full size of the instruction
buffer on the P9 hardware. All we really care about is whether or not we
will run out of instructions to dispatch while waiting for an instruction.
We don't really care about HOW the out-of-order dispatch is done as long
as we don't run out of instructions. </font><br class=""><font size="2" face="sans-serif" class="">Does my logic make sense?</font><br class=""><br class=""><font size="2" face="sans-serif" class="">I want to set the parameter but I want
to make sure I understand what is going on. </font><br class=""><br class=""><font size="2" face="sans-serif" class="">Thank you, </font><br class=""><font size="2" face="sans-serif" class="">Stefan </font><br class=""></div></blockquote></div><div class=""><div><br class=""></div><div>Yes, that's exactly the intention, but I need to clarify one</div><div>point. It's very unlikely that the code within a single iteration</div><div>would exceed the buffer size. Instead, we're looking for cases where</div><div>the OOO engine will be able to speculatively execute *many* loop</div><div>iterations before retiring some instructions from the first</div><div>iteration. These are the cases where I've seen the OOO engine running</div><div>out of resources. The heuristic estimates the number of iterations that</div><div>can be speculatively executed in the "shadow" of the loop's acyclic</div><div>path. Then it basically multiplies the number of in-flight iterations</div><div>by the size of the loop.</div><div><br class=""></div><div>/// CyclesPerIteration = max( CyclicPath, Loop-Resource-Height )</div><div>/// InFlightIterations = AcyclicPath / CyclesPerIteration</div><div>/// InFlightResources = InFlightIterations * LoopResources</div><div><br class=""></div><div>Sorry, the implementation of the math is really hard to understand</div><div>because of all the scaling that happens so that heuristics can compute</div><div>everything using integers (no floating-point allowed in the</div><div>compiler). I realize the comments aren't quite adequate.</div><div><br class=""></div><div>The comment above refers to "resources" but the only resource that's</div><div>currently modeled here is the number of micro-ops being issued. You</div><div>could conceivably model the per-functional-unit buffer sizes here, but</div><div>I don't think those values are currently used at all.</div><div><br class=""></div><div>If the number of in-flight micro-ops exceeds the buffer size, then</div><div>it's important to hide latency within the loop. When the buffer runs</div><div>out, it stalls the processor front-end, so you want to have already</div><div>issued as many independent long-latency instructions as possible when</div><div>that happens.</div></div><div class=""><br class=""></div><div class="">-Andy</div></body></html>