[cfe-dev] RFC: A new ABI for virtual calls, and a change to the virtual call representation in the IR

Fri Mar 4 09:32:41 PST 2016

> On Feb 29, 2016, at 1:53 PM, Peter Collingbourne <peter at pcc.me.uk> wrote:
> 
> Hi all,
> 
> I'd like to make a proposal to implement the new vtable ABI described in
> PR26723, which I'll call the relative ABI. That bug gives more details and
> justification for that ABI.
> 
> The user interface for the new ABI would be that -fwhole-program-vtables
> would take an optional value indicating which aspects of the program have
> whole-program scope. For example, the existing implementation of whole-program
> vcall optimization allows external code to call into translation units
> compiled with -fwhole-program-vtables, but does not allow external code to
> derive from classes defined in such translation units, so you could request
> the current behaviour with "-fwhole-program-vtables=derive", which means
> that derived classes are not allowed from outside the program. To request
> the new ABI, you can specify "-fwhole-program-vtables=call,derive",
> which means that calls and derived classes are both not allowed from
> outside the program. "-fwhole-program-vtables" would be short for
> "-fwhole-program-vtables=call,derive,anythingelseweaddinfuture".
> 
> I'll also make the observation that the new ABI does not require LTO or
> whole-program visibility at compile time; to decide whether to use the new
> ABI for a class, we just need to check that it and its bases are not in the
> whole-program-vtables blacklist.
> 
> At the same time, I'd like to change how virtual calls are represented in
> the IR. This is for a few reasons:
> 
> 1) Would allow whole-program virtual call optimization to work well with the
>   relative ABI. This ABI would complicate the IR at call sites and make it
>   harder to do matching and rewriting.
> 
> 2) Simplifies the whole-program virtual call optimization pass. Currently we
>   need to walk uses in the IR in order to determine the slot and callees for
>   each call site. This can all be avoided with a simpler representation.
> 
> 3) Would make it easier to implement dead virtual function stripping. This would
>   involve reshaping any vtable initializers and rewriting call
>   sites. Implementing this correctly is harder than it needs to be because
>   of the current representation.
> 
> My proposal is to add the following new intrinsics:

Thanks, I'm really glad you're moving forward on improving the IR representation so fast after our previous discussion. The use of these intrinsics looks a lot more friendly to me! :)
(even if I still does not make sense of the "bitset" terminology to represent the hierarchy for the metadata part)

> 
> i32 @llvm.vtable.slot.offset(metadata, i32)
> 
> This intrinsic takes a bitset name B and an offset I. It returns the byte
> offset of the I'th virtual function pointer in each of the vtables in B.
> 
> i8* @llvm.vtable.load(i8*, i32)

Why is the vtable.load taking a byte offset instead of a slot index directly? (the IR could be simpler by not requiring to call @llvm.vtable.slot.offset() for every @llvm.vtable.load())

-- 
Mehdi

> This intrinsic takes a virtual table pointer and a byte offset, and loads
> a virtual function pointer from the virtual table at the given offset.
> 
> i8* @llvm.vtable.load.relative(i8*, i32)
> 
> This intrinsic is the same as above, but it uses the relative ABI.
> 
> {i8*, i1} @llvm.vtable.checked.load(metadata %name, i8*, i32)
> {i8*, i1} @llvm.vtable.checked.load.relative(metadata %name, i8*, i32)
> 
> These intrinsics would be used to implement CFI. They are similar to the
> unchecked intrinsics, but if the second element of the result is non-zero,
> the program may call the first element of the result as a function pointer
> without causing an indirect function call to any function other than one
> potentially loaded from one of the constant globals of which %name is a member.
> 
> To minimize the impact on existing passes, the intrinsics would be lowered
> early during the regular pipeline when LTO is disabled, or early in the LTO
> pipeline when LTO is enabled. Clang would not use the llvm.vtable.slot.offset
> intrinsic when LTO is disabled, as bitset information would be unavailable.
> 
> To give the optimizer permission to reshape vtable initializers for a
> particular class, the vtable would be added to a special named metadata node
> named 'llvm.vtable.slots'. The presence of this metadata would guarantee
> that all loads beyond a given byte offset (this range would not include the
> RTTI pointer for example) are done using the above intrinsics.
> 
> We will also take advantage of the ABI break to split the class's virtual
> table group at virtual table boundaries into separate globals instead of
> emitting all virtual tables in the group into a single global. This will
> not only simplify the implementation of dead virtual function stripping,
> but also reduce code size overhead for CFI. (CFI works best if vtables for
> a base class can be laid out near vtables for derived class; the current
> ABI makes this harder to achieve.)
> 
> Example (using the relative ABI):
> 
> struct A {
>  virtual void f();
>  virtual void g();
> };
> 
> struct B {
>  virtual void h();
> };
> 
> struct C : A, B {
>  virtual void f();
>  virtual void g();
>  virtual void h();
> };
> 
> void fcall(A *a) {
>  a->f();
> }
> 
> void gcall(A *a) {
>  a->g();
> }
> 
> typedef void (A::*mfp)();
> 
> mfp getmfp() {
>  return &A::g;
> }
> 
> void callmfp(A *a, mfp m) {
>  (a->*m)();
> }
> 
> In IR:
> 
> @A_vtable = {i8*, i8*, i32, i32} {0, @A::rtti, @A::f - (@A_vtable + 16), @A::g - (@A_vtable + 16)}
> @B_vtable = {i8*, i8*, i32} {0, @B::rtti, @B::h - (@B_vtable + 16)}
> @C_vtable0 = {i8*, i8*, i32, i32, i32} {0, @C::rtti, @C::f - (@C_vtable0 + 16), @C::g - (@C_vtable0 + 16), @C::h - (@C_vtable0 + 16)}
> @C_vtable1 = {i8*, i8*, i32} {-8, @C::rtti, @C::h - (@C_vtable1 + 16)}
> 
> define void @fcall(%A* %a) {
>  %slot = call i32 @llvm.vtable.slot.offset(!"A", i32 0)
>  %vtable = load i8* %a
>  %fp = i8* @llvm.vtable.load.relative(%vtable, %slot)
>  %casted_fp = bitcast i8* %fp to void (%A*)
>  call void %casted_fp(%a)
> }
> 
> define void @gcall(%A* %a) {
>  %slot = call i32 @llvm.vtable.slot.offset(!"A", i32 1)
>  %vtable = load i8* %a
>  %fp = i8* @llvm.vtable.load.relative(%vtable, %slot)
>  %casted_fp = bitcast i8* %fp to void (%A*)
>  call void %casted_fp(%a)
> }
> 
> define {i8*, i8*} @getmfp() {
>  %slot = call i32 @llvm.vtable.slot.offset(!"A", i32 1)
>  %slotp1 = add %slot, 1
>  %result = insertvalue {i8*, i8*} {i8* 0, i8* 0}, 0, %slotp1
>  ret {i8*, i8*} %result
> }
> 
> define @callmfp(%A* %a, {i8*, i8*} %m) {
>  ; assuming the call is virtual and no this adjustment
>  %slot = extractvalue i8* %m, 0
>  %slotm1 = sub %slot, 1
>  %vtable = load i8* %a
>  %fp = i8* @llvm.vtable.load.relative(%vtable, %slotm1)
>  %casted_fp = bitcast i8* %fp to void (%A*)
>  call void %casted_fp(%a)
> }
> 
> !0 = {!"A", @A_vtable, 16}
> !1 = {!"B", @B_vtable, 16}
> !2 = {!"A", @C_vtable0, 16}
> !3 = {!"B", @C_vtable1, 16}
> !4 = {!"C", @C_vtable0, 16}
> !llvm.bitsets = {!0, !1, !2, !3, !4}
> 
> !5 = {@A_vtable, 16}
> !6 = {@B_vtable, 16}
> !7 = {@C_vtable0, 16}
> !8 = {@C_vtable1, 16}
> !llvm.vtable.slots = {!5, !6, !7, !8}
> 
> Thanks,
> -- 
> Peter