[llvm-dev] Proposal: virtual constant propagation

Thu Jan 28 09:48:05 PST 2016

Peter,

Interesting idea. One small extension that can be done is to allow the
optimization to work on a value range. For instance, if the argument type
is has know limited range (enum for instance), you can still optimize it
when the argument is not a constant via proper bit mapping/shifting. In
your example, the following is also devirtualizable.

bool g(Base *b, Type t) {
  return b->isOfType(t);
}

David

On Wed, Jan 27, 2016 at 7:57 PM, Peter Collingbourne via llvm-dev <
llvm-dev at lists.llvm.org> wrote:

> Hi all,
>
> I'd like to make the following proposal to implement an optimization called
> virtual constant propagation.
>
> ==Introduction==
> After enabling control flow integrity protection in Chromium, we have
> observed an unacceptable performance regression in certain critical layout
> microbenchmarks. Profiling [0] revealed that the cause of the regression
> was
> a large number of virtual calls, each requiring the overhead of a control
> flow integrity check.
>
> We took a closer look at the hottest virtual call sites, and found that
> they
> were mostly type checking functions of the following form:
>
> struct Base {
>   enum Type { kBase, kDerived };
>   virtual bool isDerived() const { return false; } // or:
>   virtual bool isOfType(Type t) const { return t == kBase; }
> };
> struct Derived : Base {
>   virtual bool isDerived() const { return true; }
>   virtual bool isOfType(Type t) const { return t == kDerived; }
> };
>
> bool f(Base *b) {
>   return b->isDerived();
> }
> bool g(Base *b) {
>   return b->isOfType(Base::kDerived);
> }
>
> We can make the observation that if the type hierarchy is known to be
> closed
> and the definition of each virtual function and each virtual call is
> visible
> to the compiler, calls to either of these functions can be implemented more
> efficiently by loading directly from the vtable (we refer to this as
> virtual
> constant propagation):
>
> struct Base_VTable {
>   bool isDerived : 1, isOfTypeBase : 1, isOfTypeDerived : 1;
> };
> Base_VTable base_vtable{false, true, false};
> Derived_VTable derived_vtable{true, false, true};
> bool f(Base *b) {
>   return b->vtable->isDerived;
> }
> bool g(Base *b) {
>   return b->vtable->isOfTypeDerived;
> }
>
> Another advantage of implementing the calls this way is that because there
> is no indirect call taking place, no control flow integrity check is
> necessary.
>
> I implemented a prototype of this idea that specifically targeted the
> hottest
> virtual function in one of Chromium’s layout benchmarks, and observed a
> performance improvement of median 1.62%, min -4.72%, max 204.52% for the
> layout benchmark suite. I’d like to emphasise that these numbers are from
> a regular LTO build without CFI, and that we expect to see better
> performance
> from a general implementation that targets all virtual functions.
>
> ==User interface==
> To instruct the compiler to assume that all type hierarchies are closed, a
> user can pass the -fwhole-program-vtables flag. The -fwhole-program-vtables
> flag requires the -flto flag to also be specified.
>
> Of course, there may be some type hierarchies that are not entirely
> closed, but
> the underlying assumption is that most hierarchies will not be. To support
> open
> hierarchies the user can also specify the path to a blacklist that lists
> the
> names of the open class hierarchies using a
> -fwhole-program-vtables-blacklist=
> flag. By default, the compiler will assume that hierarchies in the std
> namespace are open.
>
> ==High level design==
> Devirtualization will take place at LTO time. Candidates for virtual
> constant
> propagation will be found by searching for call sites with all-constant
> arguments to virtual functions which in all derived classes are readnone,
> which means that it is a pure function of its arguments. For each (virtual
> function, arguments) candidate pair, the compiler will compute a offset in
> bits (positive or negative) from the class’s vtable address point in which
> to store the function result, extend the size of any vtable objects to
> cover
> any such offsets, evaluate the function to compute a result and store it
> at the correct offsets. To avoid breaking ABI, the vtable layout will not
> be changed in any way. This includes pointers to optimized functions in the
> vtable, and (for Itanium ABI) the distance between address points within a
> single vtable symbol.
>
> For example, the vtable for Base and Derived will look like this:
>
> struct Base_VTable {
>   bool isDerived : 1, isOfTypeBase : 1, isOfTypeDerived : 1;
>   size_t padding : 61;
>   size_t offset_to_top;
>   size_t rtti;
>   bool (*isDerived)(const Base *this);
>   bool (*isOfType)(const Base *this, Type t);
> };
> Base_VTable base_vtable{
>   false, true, false, 0, 0, 0, &Base::isDerived, &Base::IsOfType
> };
> Base_VTable derived_vtable{
>   True, false, true, 0, 0, 0, &Derived::isDerived, &Derived::IsOfType
> };
>
> The symbol _ZTV4Base, representing the start of Base’s vtable set, will
> be defined as base_vtable+8; similarly for _ZTV7Derived and
> derived_vtable+8.
>
> To avoid bloating virtual tables by too much in degenerate cases, there
> will
> be a limit for each virtual function on the number of bits that the virtual
> function may contribute to the size of the vtable, which we may fix at
> (say)
> 2 machine words.
>
> To give an idea of how the generated code would look, here is an asm
> snippet
> from the hottest virtual call from Chromium’s layout benchmarks before
> the transformation:
>
>  255c9ae:       48 8b 03                mov    (%rbx),%rax
>  255c9b1:       be 30 00 00 00          mov    $0x30,%esi
>  255c9b6:       48 89 df                mov    %rbx,%rdi
>  255c9b9:       ff 90 e0 02 00 00       callq  *0x2e0(%rax)
>  255c9bf:       84 c0                   test   %al,%al
>  255c9c1:       74 04                   je     255c9c7
>  255c9c3:       31 c0                   xor    %eax,%eax
>  255c9c5:       5b                      pop    %rbx
>  255c9c6:       c3                      retq
>
> And after:
>
>  255a8ee:       48 8b 03                mov    (%rbx),%rax
>  255a8f1:       f6 40 e6 01             testb  $0x1,-0x1a(%rax)
>  255a8f5:       74 04                   je     255a8fb
>  255a8f7:       31 c0                   xor    %eax,%eax
>  255a8f9:       5b                      pop    %rbx
>  255a8fa:       c3                      retq
>
> ==LLVM IR-level design==
> Given a class name, how can we determine the closed set of derived classes
> and
> possible function pointers for a call site? As it turns out the IR already
> has a way of expressing this information: bitsets [1], which are already
> being
> used to implement CFI. We can encode the information for devirtualization
> by combining the @llvm.bitset.test and @llvm.assume intrinsics.
>
> %vtable = load i8**, i8*** %obj
> %p = call i1 @llvm.bitset.test(%vtable, “Base”)
> call void @llvm.assume(i1 %p) ; %vtable is assumed to point to
>                               ; _ZTV4Base+16 or _ZTV7Derived+16.
> %fptrptr = getelementptr i8* %vtable, 1
> %fptr = load i8*, i8** %fptrptr ; %fptr must point to Base::isOfType
>                                 ; or Derived::isOfType.
> %fptr_casted = bitcast i8% %fptr to i1 (i8***, i32)
> %result = call i1 %fptr_casted(i8*** %obj, i32 0)
>
> This gives the optimizer all the information it needs to implement the
> optimization: the addresses of the virtual tables, the addresses of the
> function pointers (within the virtual table initializers) and the constant
> arguments. Note that the compiler does not need to do any of the work
> described in the above link to lay out globals or create bitsets if all
> calls to llvm.bitset.test are passed to llvm.assume.
>
> If virtual constant propagation succeeds, the transformed IR will look
> like this:
>
> %vtable = load i8*, i8** %obj
> %vtable_bit_addr = getelementptr i8* %vtable, i64 -17
> %vtable_bit = load i8 %vtable_bit_addr
> %vtable_bit_and = and i8 %vtable_bit, 1
> %result = icmp ne %vtable_bit_and, 0
>
> The pass that applies this transformation will be placed early in the LTO
> pipeline, before most of the regular optimization passes. The pass could
> later
> be extended to do general devirtualization based on the bitset information:
>
>  - The pass can implement single-deriver (i.e. only one derived class with
> a
>    non-pure virtual function definition) and single-implementation
>    (i.e. multiple derived classes all sharing a virtual function definition
>    from a base class) devirtualization by checking that each of the
> possible
>    values loaded from the vtables in the bitset are either identical or
>    __cxa_pure_virtual (calls to pure virtuals are UB, so we’d be fine
>    disregarding them), and propagating the identical address into the
>    function callee.
>  - The pass could also potentially implement speculative devirtualization
>    (i.e. testing the vtable entry and guarding direct calls on it) by
>    generating a comparison against the known address point. The pass will
>    know exactly how many possible classes there will be for each call site,
>    so we could have logic to speculatively devirtualize if that number is
>    sufficiently small.
>
> The design when CFI is enabled is slightly different, as we need to give
> the compiler the ability to eliminate the unnecessary llvm.bitset.test
> call. Recall that the IR looks like this:
>
> %vtable = load i8**, i8*** %obj
> %p = call i1 @llvm.bitset.test(%vtable, “Base”)
> br i1 %p, label %call, label %abort
>
> call:
> %fptr = load i8*, i8** %vtable
> %fptr_casted = bitcast i8% %fptr to i1 (i8***, i32)
> call i1 %fptr_casted(i8*** %obj, i32 0)
>
> abort:
> call void @llvm.trap()
>
> We cannot simply have our optimization pass transform this IR to eliminate
> the
> llvm.bitset.test call, as the transformation would not be
> semantics-preserving
> (we already have a CFI setting that emits diagnostics in the abort branch,
> which would break under such a transformation). Instead, we can introduce
> a new intrinsic, @llvm.bitset.check, to be used only in production mode
> (i.e. with diagnostics disabled):
>
> {i8*, i1} @llvm.bitset.check(i8* %ptr, metadata %name)
>
> If %ptr is in the bitset identified by %name, the function returns {%ptr,
> true}. But if %ptr is not in %name, the behaviour is undefined, modulo that
> if the second element of the result is non-zero, the program may load and
> call a function pointer from an address derived from the first element of
> the
> result without causing an indirect function call to any function other than
> one potentially loaded from one of the constant globals of which %name is
> a member (i.e. one of the valid vtables for %name). A CFI-protected virtual
> call would then look like this (eliding bitcasts for brevity):
>
> %vtable = load i8**, i8*** %obj
> %pair = call {i8*, i1} @llvm.bitset.check(i8* %vtable, “Base”)
> %checked_vtable = extractelement %pair, 0
> %p = extractelement %pair, 1
> br i1 %p, label %call, label %abort
>
> call:
> %fptr = load i8*, i8** %checked_vtable
> %result = call i1 %fptr(i8*** %obj, i32 0)
>
> abort:
> call void @llvm.trap()
>
> Applying virtual constant propagation to @llvm.bitset.check would be very
> similar to applying it to @llvm.bitset.test, but if the constant
> propagation
> succeeds, the function will return true as the second element:
>
> %vtable = load i8**, i8*** %obj
> br i1 true, label %call, label %abort
>
> call:
> %vtable_bit_addr = getelementptr i8* %vtable, i64 -17
> %vtable_bit = load i8 %vtable_bit_addr
> %vtable_bit_and = and i8 %vtable_bit, 1
> %result = icmp ne %vtable_bit_and, 0
>
> abort:
> call void @llvm.trap()
>
> Now the existing SimplifyCFG pass can easily eliminate the unnecessary
> branching.
>
> In order to allow for regular constant folding through the
> llvm.bitset.check
> intrinsic, the code in lib/Analysis/ConstantFolding.cpp would be taught to
> look through the intrinsic’s argument.
>
> --
> Peter
>
> [0] https://code.google.com/p/chromium/issues/detail?id=580389
> [1] http://llvm.org/docs/BitSets.html
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160128/52bbf1b9/attachment.html>