<div dir="ltr"><div>Peter,</div><div><br></div>Interesting idea. One small extension that can be done is to allow the optimization to work on a value range. For instance, if the argument type is has know limited range (enum for instance), you can still optimize it when the argument is not a constant via proper bit mapping/shifting. In your example, the following is also devirtualizable.<div><br></div><div><div>bool g(Base *b, Type t) {</div><div>  return b->isOfType(t);</div><div>}</div></div><div><br></div><div>David</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Jan 27, 2016 at 7:57 PM, Peter Collingbourne via llvm-dev <span dir="ltr"><<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi all,<br>

<br>

I'd like to make the following proposal to implement an optimization called<br>

virtual constant propagation.<br>

<br>

==Introduction==<br>

After enabling control flow integrity protection in Chromium, we have<br>

observed an unacceptable performance regression in certain critical layout<br>

microbenchmarks. Profiling [0] revealed that the cause of the regression was<br>

a large number of virtual calls, each requiring the overhead of a control<br>

flow integrity check.<br>

<br>

We took a closer look at the hottest virtual call sites, and found that they<br>

were mostly type checking functions of the following form:<br>

<br>

struct Base {<br>

  enum Type { kBase, kDerived };<br>

  virtual bool isDerived() const { return false; } // or:<br>

  virtual bool isOfType(Type t) const { return t == kBase; }<br>

};<br>

struct Derived : Base {<br>

  virtual bool isDerived() const { return true; }<br>

  virtual bool isOfType(Type t) const { return t == kDerived; }<br>

};<br>

<br>

bool f(Base *b) {<br>

  return b->isDerived();<br>

}<br>

bool g(Base *b) {<br>

  return b->isOfType(Base::kDerived);<br>

}<br>

<br>

We can make the observation that if the type hierarchy is known to be closed<br>

and the definition of each virtual function and each virtual call is visible<br>

to the compiler, calls to either of these functions can be implemented more<br>

efficiently by loading directly from the vtable (we refer to this as virtual<br>

constant propagation):<br>

<br>

struct Base_VTable {<br>

  bool isDerived : 1, isOfTypeBase : 1, isOfTypeDerived : 1;<br>

};<br>

Base_VTable base_vtable{false, true, false};<br>

Derived_VTable derived_vtable{true, false, true};<br>

bool f(Base *b) {<br>

  return b->vtable->isDerived;<br>

}<br>

bool g(Base *b) {<br>

  return b->vtable->isOfTypeDerived;<br>

}<br>

<br>

Another advantage of implementing the calls this way is that because there<br>

is no indirect call taking place, no control flow integrity check is necessary.<br>

<br>

I implemented a prototype of this idea that specifically targeted the hottest<br>

virtual function in one of Chromium’s layout benchmarks, and observed a<br>

performance improvement of median 1.62%, min -4.72%, max 204.52% for the<br>

layout benchmark suite. I’d like to emphasise that these numbers are from<br>

a regular LTO build without CFI, and that we expect to see better performance<br>

from a general implementation that targets all virtual functions.<br>

<br>

==User interface==<br>

To instruct the compiler to assume that all type hierarchies are closed, a<br>

user can pass the -fwhole-program-vtables flag. The -fwhole-program-vtables<br>

flag requires the -flto flag to also be specified.<br>

<br>

Of course, there may be some type hierarchies that are not entirely closed, but<br>

the underlying assumption is that most hierarchies will not be. To support open<br>

hierarchies the user can also specify the path to a blacklist that lists the<br>

names of the open class hierarchies using a -fwhole-program-vtables-blacklist=<br>

flag. By default, the compiler will assume that hierarchies in the std<br>

namespace are open.<br>

<br>

==High level design==<br>

Devirtualization will take place at LTO time. Candidates for virtual constant<br>

propagation will be found by searching for call sites with all-constant<br>

arguments to virtual functions which in all derived classes are readnone,<br>

which means that it is a pure function of its arguments. For each (virtual<br>

function, arguments) candidate pair, the compiler will compute a offset in<br>

bits (positive or negative) from the class’s vtable address point in which<br>

to store the function result, extend the size of any vtable objects to cover<br>

any such offsets, evaluate the function to compute a result and store it<br>

at the correct offsets. To avoid breaking ABI, the vtable layout will not<br>

be changed in any way. This includes pointers to optimized functions in the<br>

vtable, and (for Itanium ABI) the distance between address points within a<br>

single vtable symbol.<br>

<br>

For example, the vtable for Base and Derived will look like this:<br>

<br>

struct Base_VTable {<br>

  bool isDerived : 1, isOfTypeBase : 1, isOfTypeDerived : 1;<br>

  size_t padding : 61;<br>

  size_t offset_to_top;<br>

  size_t rtti;<br>

  bool (*isDerived)(const Base *this);<br>

  bool (*isOfType)(const Base *this, Type t);<br>

};<br>

Base_VTable base_vtable{<br>

  false, true, false, 0, 0, 0, &Base::isDerived, &Base::IsOfType<br>

};<br>

Base_VTable derived_vtable{<br>

  True, false, true, 0, 0, 0, &Derived::isDerived, &Derived::IsOfType<br>

};<br>

<br>

The symbol _ZTV4Base, representing the start of Base’s vtable set, will<br>

be defined as base_vtable+8; similarly for _ZTV7Derived and derived_vtable+8.<br>

<br>

To avoid bloating virtual tables by too much in degenerate cases, there will<br>

be a limit for each virtual function on the number of bits that the virtual<br>

function may contribute to the size of the vtable, which we may fix at (say)<br>

2 machine words.<br>

<br>

To give an idea of how the generated code would look, here is an asm snippet<br>

from the hottest virtual call from Chromium’s layout benchmarks before<br>

the transformation:<br>

<br>

 255c9ae:       48 8b 03                mov    (%rbx),%rax<br>

 255c9b1:       be 30 00 00 00          mov    $0x30,%esi<br>

 255c9b6:       48 89 df                mov    %rbx,%rdi<br>

 255c9b9:       ff 90 e0 02 00 00       callq  *0x2e0(%rax)<br>

 255c9bf:       84 c0                   test   %al,%al<br>

 255c9c1:       74 04                   je     255c9c7<br>

 255c9c3:       31 c0                   xor    %eax,%eax<br>

 255c9c5:       5b                      pop    %rbx<br>

 255c9c6:       c3                      retq<br>

<br>

And after:<br>

<br>

 255a8ee:       48 8b 03                mov    (%rbx),%rax<br>

 255a8f1:       f6 40 e6 01             testb  $0x1,-0x1a(%rax)<br>

 255a8f5:       74 04                   je     255a8fb<br>

 255a8f7:       31 c0                   xor    %eax,%eax<br>

 255a8f9:       5b                      pop    %rbx<br>

 255a8fa:       c3                      retq<br>

<br>

==LLVM IR-level design==<br>

Given a class name, how can we determine the closed set of derived classes and<br>

possible function pointers for a call site? As it turns out the IR already<br>

has a way of expressing this information: bitsets [1], which are already being<br>

used to implement CFI. We can encode the information for devirtualization<br>

by combining the @llvm.bitset.test and @llvm.assume intrinsics.<br>

<br>

%vtable = load i8**, i8*** %obj<br>

%p = call i1 @llvm.bitset.test(%vtable, “Base”)<br>

call void @llvm.assume(i1 %p) ; %vtable is assumed to point to<br>

                              ; _ZTV4Base+16 or _ZTV7Derived+16.<br>

%fptrptr = getelementptr i8* %vtable, 1<br>

%fptr = load i8*, i8** %fptrptr ; %fptr must point to Base::isOfType<br>

                                ; or Derived::isOfType.<br>

%fptr_casted = bitcast i8% %fptr to i1 (i8***, i32)<br>

%result = call i1 %fptr_casted(i8*** %obj, i32 0)<br>

<br>

This gives the optimizer all the information it needs to implement the<br>

optimization: the addresses of the virtual tables, the addresses of the<br>

function pointers (within the virtual table initializers) and the constant<br>

arguments. Note that the compiler does not need to do any of the work<br>

described in the above link to lay out globals or create bitsets if all<br>

calls to llvm.bitset.test are passed to llvm.assume.<br>

<br>

If virtual constant propagation succeeds, the transformed IR will look<br>

like this:<br>

<br>

%vtable = load i8*, i8** %obj<br>

%vtable_bit_addr = getelementptr i8* %vtable, i64 -17<br>

%vtable_bit = load i8 %vtable_bit_addr<br>

%vtable_bit_and = and i8 %vtable_bit, 1<br>

%result = icmp ne %vtable_bit_and, 0<br>

<br>

The pass that applies this transformation will be placed early in the LTO<br>

pipeline, before most of the regular optimization passes. The pass could later<br>

be extended to do general devirtualization based on the bitset information:<br>

<br>

 - The pass can implement single-deriver (i.e. only one derived class with a<br>

   non-pure virtual function definition) and single-implementation<br>

   (i.e. multiple derived classes all sharing a virtual function definition<br>

   from a base class) devirtualization by checking that each of the possible<br>

   values loaded from the vtables in the bitset are either identical or<br>

   __cxa_pure_virtual (calls to pure virtuals are UB, so we’d be fine<br>

   disregarding them), and propagating the identical address into the<br>

   function callee.<br>

 - The pass could also potentially implement speculative devirtualization<br>

   (i.e. testing the vtable entry and guarding direct calls on it) by<br>

   generating a comparison against the known address point. The pass will<br>

   know exactly how many possible classes there will be for each call site,<br>

   so we could have logic to speculatively devirtualize if that number is<br>

   sufficiently small.<br>

<br>

The design when CFI is enabled is slightly different, as we need to give<br>

the compiler the ability to eliminate the unnecessary llvm.bitset.test<br>

call. Recall that the IR looks like this:<br>

<br>

%vtable = load i8**, i8*** %obj<br>

%p = call i1 @llvm.bitset.test(%vtable, “Base”)<br>

br i1 %p, label %call, label %abort<br>

<br>

call:<br>

%fptr = load i8*, i8** %vtable<br>

%fptr_casted = bitcast i8% %fptr to i1 (i8***, i32)<br>

call i1 %fptr_casted(i8*** %obj, i32 0)<br>

<br>

abort:<br>

call void @llvm.trap()<br>

<br>

We cannot simply have our optimization pass transform this IR to eliminate the<br>

llvm.bitset.test call, as the transformation would not be semantics-preserving<br>

(we already have a CFI setting that emits diagnostics in the abort branch,<br>

which would break under such a transformation). Instead, we can introduce<br>

a new intrinsic, @llvm.bitset.check, to be used only in production mode<br>

(i.e. with diagnostics disabled):<br>

<br>

{i8*, i1} @llvm.bitset.check(i8* %ptr, metadata %name)<br>

<br>

If %ptr is in the bitset identified by %name, the function returns {%ptr,<br>

true}. But if %ptr is not in %name, the behaviour is undefined, modulo that<br>

if the second element of the result is non-zero, the program may load and<br>

call a function pointer from an address derived from the first element of the<br>

result without causing an indirect function call to any function other than<br>

one potentially loaded from one of the constant globals of which %name is<br>

a member (i.e. one of the valid vtables for %name). A CFI-protected virtual<br>

call would then look like this (eliding bitcasts for brevity):<br>

<br>

%vtable = load i8**, i8*** %obj<br>

%pair = call {i8*, i1} @llvm.bitset.check(i8* %vtable, “Base”)<br>

%checked_vtable = extractelement %pair, 0<br>

%p = extractelement %pair, 1<br>

br i1 %p, label %call, label %abort<br>

<br>

call:<br>

%fptr = load i8*, i8** %checked_vtable<br>

%result = call i1 %fptr(i8*** %obj, i32 0)<br>

<br>

abort:<br>

call void @llvm.trap()<br>

<br>

Applying virtual constant propagation to @llvm.bitset.check would be very<br>

similar to applying it to @llvm.bitset.test, but if the constant propagation<br>

succeeds, the function will return true as the second element:<br>

<br>

%vtable = load i8**, i8*** %obj<br>

br i1 true, label %call, label %abort<br>

<br>

call:<br>

%vtable_bit_addr = getelementptr i8* %vtable, i64 -17<br>

%vtable_bit = load i8 %vtable_bit_addr<br>

%vtable_bit_and = and i8 %vtable_bit, 1<br>

%result = icmp ne %vtable_bit_and, 0<br>

<br>

abort:<br>

call void @llvm.trap()<br>

<br>

Now the existing SimplifyCFG pass can easily eliminate the unnecessary branching.<br>

<br>

In order to allow for regular constant folding through the llvm.bitset.check<br>

intrinsic, the code in lib/Analysis/ConstantFolding.cpp would be taught to<br>

look through the intrinsic’s argument.<br>

<span class="HOEnZb"><font color="#888888"><br>

--<br>

Peter<br>

<br>

[0] <a href="https://code.google.com/p/chromium/issues/detail?id=580389" rel="noreferrer" target="_blank">https://code.google.com/p/chromium/issues/detail?id=580389</a><br>

[1] <a href="http://llvm.org/docs/BitSets.html" rel="noreferrer" target="_blank">http://llvm.org/docs/BitSets.html</a><br>

_______________________________________________<br>

LLVM Developers mailing list<br>

<a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a><br>

<a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><br>

</font></span></blockquote></div><br></div>