[llvm-dev] Adding support for self-modifying branches to LLVM?

Philip Reames via llvm-dev llvm-dev at lists.llvm.org
Thu Jan 21 14:09:18 PST 2016



On 01/21/2016 01:51 PM, Sean Silva wrote:
>
>
> On Thu, Jan 21, 2016 at 1:33 PM, Philip Reames 
> <listmail at philipreames.com <mailto:listmail at philipreames.com>> wrote:
>
>
>
>     On 01/19/2016 09:04 PM, Sean Silva via llvm-dev wrote:
>>
>>     AFAIK, the cost of a well-predicted, not-taken branch is the same
>>     as a nop on every x86 made in the last many years. See
>>     http://www.agner.org/optimize/instruction_tables.pdf
>>     Generally speaking a correctly-predicted not-taken branch is
>>     basically identical to a nop, and a correctly-predicted taken
>>     branch is has an extra overhead similar to an "add" or other
>>     extremely cheap operation.
>     Specifically on this point only: While absolutely true for most
>     micro-benchmarks, this is less true at large scale.  I've
>     definitely seen removing a highly predictable branch (in many,
>     many places, some of which are hot) to benefit performance in the
>     5-10% range.  For instance, removing highly predictable branches
>     is the primary motivation of implicit null checking. 
>     (http://llvm.org/docs/FaultMaps.html). Where exactly the
>     performance improvement comes from is hard to say, but,
>     empirically, it does matter.
>
>     (Caveat to above: I have not run an experiment that actually put
>     in the same number of bytes in nops.  It's possible the entire
>     benefit I mentioned is code size related, but I doubt it given how
>     many ticks a sample profiler will show on said branches.)
>
>
> Interesting. Another possible explanation is that these extra branches 
> cause contention on branch-prediction resources.
I've heard and proposed this explanation in the past as well, but I've 
never heard of anyone able to categorically answer the question.

The other explanation I've considered is that the processor has a finite 
speculation depth (i.e. how many in flight predicted branches), and the 
extra branches cause the processor to not be able to speculate 
"interesting" branches because they're full of uninteresting ones.  
However, my hardware friends tell me this is a somewhat questionable 
explanation since the check branches should be easy to satisfy and 
retire quickly.

> In the past when talking with Dan about WebAssembly sandboxing, IIRC 
> he said that they found about 15% overhead, due primarily to 
> branch-prediction resource contention.
15% seems a bit high to me, but I don't have anything concrete to share 
here unfortunately.
> In fact I think they had a pretty clear idea of wanting a new 
> instruction which is just a "statically predict never taken and don't 
> use any branch-prediction resources" branch (this is on x86 IIRC; some 
> arches actually obviously have such an instruction!).
This has been on my wish list for a while.  It would make many things so 
much easier.

The sickly amusing bit is that x86 has two different forms of this, 
neither of which actually work:
1) There are prefixes for branches which are supposed to control the 
prediction direction.  My understanding is that code which tried using 
them was so often wrong, that modern chips interpret them as nop 
padding.  We actually use this to produce near arbitrary length nops.  :)
2) x86 (but not x86-64) had a "into" instruction which triggered an 
interrupt if the overflow bit is set.  (Hey, signal handlers are just 
weird branches right? :p)  However, this does not work as designed in 
x86-64.  My understanding is that the original AMD implementation had a 
bug in this instruction and the bug essentially got written into the 
spec for all future chips.  :(

Philip
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160121/b80b7e9f/attachment-0001.html>


More information about the llvm-dev mailing list