[llvm-commits] Trampoline changes

Sun Sep 9 01:46:21 PDT 2007

Hi Nick,

> > It is easy to construct testcases for which this works, but sadly I was unable to get
> > the optimization to occur even once in a trampoline heavy real-world program without
> > jacking-up the inlining limit hugely (to 100000; 10000 wasn't enough).  Still, I'm
> > hoping that it may sometimes be useful.
> 
> Why does changing the inliner threshold help? Are you talking about 
> inlining the trampoline itself or the llvm.init.trampoline intrinsic? 
> Could we adjust the weights in the inliner so that it does it more 
> often, seeing as a trampoline call is much more expensive than a regular 
> call?
> 
> Would you mind sending a small before and after example for the 
> transformation implemented by your instcombine patch?

consider this example:

int process(int (*func)(int)) {
  return func(1);
}

int nest(int n)
{
  int f(int m) {
    return m == n;
  }
  return process(&f);
}

A pointer to the nested function "f" is passed to "process".  Note that f accesses "n", an
argument to the parent function "nest".  When process calls f, somehow f has to be able to
get the value of n, even though n wasn't passed to process.  This is done using black magic.
In fact process is not passed a pointer to f.  Instead a small piece of code is contructed
on the stack of nest, and a pointer to this code (the trampoline) is passed to process instead.
The trampoline code calls f, passing it any extra info it needs, like n.

To see how this works exactly, consider the corresponding bitcode:

	%struct.FRAME.nest = type { i32, i32 (i32)* }
	%struct.__builtin_trampoline = type { [10 x i8] }

declare i8* @llvm.init.trampoline(i8*, i8*, i8*)

define i32 @process(i32 (i32)* %func) {
entry:
	%tmp2 = tail call i32 %func( i32 1 )		; <i32> [#uses=1]
	ret i32 %tmp2
}

define internal i32 @f.934(%struct.FRAME.nest* nest  %CHAIN.1, i32 %m) {
entry:
	%tmp3 = getelementptr %struct.FRAME.nest* %CHAIN.1, i32 0, i32 0		; <i32*> [#uses=1]
	%tmp4 = load i32* %tmp3, align 4		; <i32> [#uses=1]
	%tmp7 = icmp eq i32 %tmp4, %m		; <i1> [#uses=1]
	%tmp78 = zext i1 %tmp7 to i32		; <i32> [#uses=1]
	ret i32 %tmp78
}

define i32 @nest(i32 %n) {
entry:
	%FRAME.0 = alloca %struct.FRAME.nest, align 8		; <%struct.FRAME.nest*> [#uses=3]
	%TRAMP.216 = alloca [10 x i8], align 16		; <[10 x i8]*> [#uses=1]
	%TRAMP.216.sub = getelementptr [10 x i8]* %TRAMP.216, i32 0, i32 0		; <i8*> [#uses=1]
	%tmp3 = getelementptr %struct.FRAME.nest* %FRAME.0, i32 0, i32 0		; <i32*> [#uses=1]
	store i32 %n, i32* %tmp3, align 8
	%FRAME.06 = bitcast %struct.FRAME.nest* %FRAME.0 to i8*		; <i8*> [#uses=1]
	%tramp = call i8* @llvm.init.trampoline( i8* %TRAMP.216.sub, i8* bitcast (i32 (%struct.FRAME.nest* nest , i32)* @f.934 to i8*), i8* %FRAME.06 )		; <i8*> [#uses=1]
	%tmp7 = getelementptr %struct.FRAME.nest* %FRAME.0, i32 0, i32 1		; <i32 (i32)**> [#uses=1]
	%tmp89 = bitcast i8* %tramp to i32 (i32)*		; <i32 (i32)*> [#uses=2]
	store i32 (i32)* %tmp89, i32 (i32)** %tmp7, align 8
	%tmp13 = call i32 @process( i32 (i32)* %tmp89 )		; <i32> [#uses=1]
	ret i32 %tmp13
}

First note that the nested function f has been turned into @f.934 and all of a sudden has an extra
parameter %CHAIN.1  This parameter is used to pass in the arguments and variables (such as n) of the
parent function nest.  If you look closely you will see that it is a pointer to a struct that holds
both n and a function pointer suitable for calling f (in more complicated situations this function
pointer needs to live in this frame struct, but here it would have been OK to simply be another
variable on nest's stack).  This function pointer is %tramp, the result of the @llvm.init.trampoline
call.  It is %tramp that is passed to process, after being bitcast to the right type.

Now take a look at this part of nest after inlining @process:

        %tramp = call i8* @llvm.init.trampoline( i8* %TRAMP.216.sub, i8* bitcast (i32 (%struct.FRAME.nest* nest , i32)* @f.934 to i8*), i8* %FRAME.06 )         ; <i8*> [#uses=1]
        %tmp7 = getelementptr %struct.FRAME.nest* %FRAME.0, i32 0, i32 1                ; <i32 (i32)**> [#uses=1]
        %tmp89 = bitcast i8* %tramp to i32 (i32)*               ; <i32 (i32)*> [#uses=2]
        store i32 (i32)* %tmp89, i32 (i32)** %tmp7, align 8
        %tmp2.i = call i32 %tmp89( i32 1 )              ; <i32> [#uses=1]

The use of a trampoline is now pointless: since we are inside nest the extra argument to f
is directly available.  So we could rewrite the "call i32 %tmp89( i32 1 )" with a direct call
to f.  This is what the new optimization does.  Here is the result:

        %tramp = call i8* @llvm.init.trampoline( i8* %TRAMP.216.sub, i8* bitcast (i32 (%struct.FRAME.nest* nest , i32)* @f.934 to i8*), i8* %FRAME.06 )         ; <i8*> [#uses=1]
        %tmp7 = getelementptr %struct.FRAME.nest* %FRAME.0, i32 0, i32 1                ; <i32 (i32)**> [#uses=1]
        %tmp89 = bitcast i8* %tramp to i32 (i32)*               ; <i32 (i32)*> [#uses=1]
        store i32 (i32)* %tmp89, i32 (i32)** %tmp7, align 8
        %nest = bitcast i8* %FRAME.06 to %struct.FRAME.nest*            ; <%struct.FRAME.nest*> [#uses=1]
        %tmp2.i1 = call i32 @f.934( %struct.FRAME.nest* %nest nest , i32 1 )            ; <i32> [#uses=1]

The call to the trampoline has been replaced by a direct call to @f.934, with the extra argument it requires
spliced in.  Note that the call to init_trampoline has not been removed even though it is now useless.  I hope
to do something about this later.

Let me now explain the inlining problem.  First note that in practice the only reason to take a pointer to
a nested function is to pass them to some outside routine like process.  Since otherwise you could just
call the nested function directly.  Second, in order to simplify a trampoline call the extra arguments ("n")
to the nested function ("f") have to be available at the point of the call.  This is only going to happen
if process (or whatever) has been inlined into the parent function ("nest").  But inlining is only going
to happen if the outside function is small.

The Ada containers library expects users to pass in function pointers and in many situations you are
obliged to pass in a pointer to a nested function.  Since the gcc Ada front-end implements these pointers
using trampolines this means that using the container library results in a lot of trampoline calls.
(The front-end could have made use of "fat pointers" instead, avoiding trampolines; I hear that they plan
to do this at some point exactly because of the cost of trampolines, but for the moment it's trampolines).
My hope was that some of these trampoline calls could be eliminated using this optimization.  But for
that to happen, the routines in the container library have to be small enough to be inlined.  They are
not.  In fact they're rather small in terms of lines of code, but exception handling etc really bulks
up the LLVM bitcode making the inliner pass them over.

Ciao,

Duncan.