[LLVMdev] Introducing a branch optimization and prediction pass
edwintorok at gmail.com
Mon Mar 31 07:01:54 PDT 2008
Evan Cheng wrote:
> On Mar 29, 2008, at 6:02 AM, Török Edwin wrote:
>> I would like to transform unpredictable conditional branches with a
>> small body into cmov instructions to reduce branch miss penalty.
>> LLVM generates cmov/setcc instructions when SelectInst is used. The
>> is to transform regular branches into selects, when that is possible
> LLVM is already aggressively turning branches into selects as far as I
> can see.
Does this happen in Codegen?
I tried some simple examples a while ago, and IIRC llvm missed some
opportunities to use cmov.
>> However this needs branch prediction info, and it seems LLVM doesn't
>> provide that AFAICT.
>> Could gcc's branch predictor be used here, or are there plans for an
>> llvm-specific branch predictor?
> Sure, either the FE or some earlier profiling pass should provide some
> branch predication info.
Ok. At least one of the above should be implemented before this
optimization can have a real use.
However for experimentation we could try to apply this optimization
assuming each branch in unpredictable.
At least we'll get an idea of what is the worst performance regression
that could happen if branch prediction info is wrong ;)
> Ok. You want to speculatively execute the code and only writes back
> the result if the predicate is true, right?
Yes, I speculatively execute the code.
However the result is always written, because I cannot use 'cmov' to
write to memory (just registers).
So the solution proposed in the AMD manual, is to use cmov to select the
address to write to. If predicate is false, it is written to a dummy
address on the stack:
cmp eax, ebx ; a < b ?
cmovge edi, esi ; ptr = (a >= b) ? &dummy : &c[i]
cmovl ecx, edx ; a < b ? i : i + 1
mov [edi], eax ; *ptr = a
IIRC gcc-4.2 didn't generate cmov for this case.
I'll write some simple examples, build with llvm, gcc-4.2, gcc-4.3, and
icc, then measure speeds. Last time I did that on a simple example, and
the result was 200 Mb/s vs. 1Gb/s.
Then apply this optimization (either by hand or using a small prototype)
and see if there are any significant improvements.
> If there are any instructions with side effects then this cannot be
>> * determine if branch is statically/dynamically predictable:
>> * statically: if compiler can predict branch, don't apply this
>> * dynamically: can branch be predicted based on execution history
>> (i.e. can CPU predict it?)
> Based on profiling data from previous runs? Or dynamic profiling in
> JIT situation?
Good idea. However in absence of profiling info there should be some
heuristics, I am not sure what that could be ATM.
>> * how to determine if a branch's body is small?
>> if execution_cost_of(body) +
>> execution_cost_of(cmov)/branch_miss_probability < branch_miss_penalty,
>> or simply: execution_cost_of(body) < branch_miss_penalty
>> If it is bigger, then it is faster if we take the miss penalty (and
>> don't execute the body)
>> But miss-penalties, and execution_costs are target specific, so use
>> TargetData here (and teach it about penalties), or
>> use some heuristic based on number of instruction in basic block?
> Sounds reasonable. But it's potentially more complicated than this for
> a target like x86 that have very few registers. The speculatively
> executed code will clobber registers so it has significant cost if the
> result of computation is not written back.
I experimented on x86-64 a while ago, that doesn't have such a shortage
of registers, and results were promising.
I have some code that processes some data in a loop, but due to branch
misspredictions, it can "only" work at ~200 Mb/s.
Eliminating the branches could speed it up to 1Gb/s or more, since there
are no other penalties in that code (L1 data cache misses are rare, the
only bottleneck is branch misprediction).
I'll redo my tests on x86 too (see above), and maybe we can come up with
an empiric formula to use for x86.
Something like associating a cost with register pressure increase?
>> * if we cannot transform all instructions from the basic block into
>> instr+cmov, then it is not worth applying this optimization (since
>> have a branch anyway).
> Sure. It might be worthwhile to extend the if-converter to do this.
> The first step is to turn it on for x86 to convert blocks which are
> entirely move's.
That would mean to apply this optimization on machine-basic-blocks, right?
I was thinking of a generic llvm IR optimization pass, but maybe
machine-basic-blocks pass is better, since we are doing something very
specific for targets.
>> Does Codegen guarantee to always transform Select into CMOV/SETCC (or
>> equivalent), on targets that support it?
>> * memory ordering/multi-thread safety:
> That's something to worry about later. :-)
>> If we unconditionally read from slow_filter_table it could actually
>> reduce the performance (assume slow_filter_table is huge), and might
>> be legal because we violate
>> the short-circuit evaluation.
> Right, speculative loads would require much more analysis to ensure
> safety. This might be something you don't want to do in step 1.
Makes sense, first implementation shouldn't break too much ;)
>> (BTW, the branch predictor should say it is predictable, and we
>> apply the transformation at all, but we cannot rely on the branch
>> predictor to make
>> short-circuit eval guarantees).
> Adding branch prediction info is obviously useful even if speculation
> is not performed. There are always CFG optimization passes (any
> perhaps isel passes) that can make use of them. I am not sure whether
> speculation can actually have a positive impact on x86. It's something
> that requires some experimentation if you are interested.
I have a prototype of this optimization (without branch prediction), and
I can use that to experiment.
However it only handles some very simple cases, and I should rewrite it
to be more generic.
More information about the llvm-dev