[LLVMdev] Tail calls (TCO) in PNaCL | PNaCl Bitcode reference manual

Fri Aug 2 00:26:55 PDT 2013

Hi Mark,

On 2013-08-02 04:11, Mark Seaborn wrote:
> That note in the documentation seems to be incorrect, because LLVM
> will do tail call optimisations on at least x86 when using the "ccc"
> calling convention.  For example:
> 
> $ cat tail_call1.c
> void foo(int arg);
> void bar(int arg) {
>   foo(arg);
> }
> 
> $ clang tail_call1.c -S -o - -O2
> ...
> bar:                                    # @bar
> ...
>     jmp    foo                     # TAILCALL
> ...

That's actually a sibling call, which is a much less general form.

See:

  http://llvm.org/releases/3.3/docs/CodeGenerator.html#sibling-call-optimization

The above C roughly translates into:

  declare i32 @callee(i32)
  define i32 @caller(i32 %a1) {
    %1 = tail call i32 @callee(i32 %a1)
    ret i32 %1
  }

Which, as you note, produces a jmp.  However, simply extending the
argument list of the callee breaks this on x86-32:

  declare i32 @callee(i32, i32)
  define i32 @caller(i32 %a1) {
    %1 = tail call i32 @callee(i32 %a1, i32 0)
    ret i32 %1
  }

That results in a call rather than a jmp.  You won't get a jmp back
unless you write:

  declare fastcc i32 @callee(i32, i32)
  define fastcc i32 @caller(i32 %a1) {
    %1 = tail call fastcc i32 @callee(i32 %a1, i32 0)
    ret i32 %1
  }

(Or use cc10 rather than fastcc.)

You can even break the tail call elimination by simply passing a value
not in the caller's incoming argument stack (or one in the wrong
position), again on x86-32:

  declare i32 @callee(i32)
  define i32 @caller(i32 %a1) {
    %a = add i32 %a1, %a1
    %1 = tail call i32 @callee(i32 %a)
    ret i32 %1
  }

x86-64 will treat more things as sibling calls because you need 7
arguments before anything reaches the stack.  But the same thing will
happen on X86-64 at that point.  This results in call rather than jmp:

declare i32 @callee(i32, i32, i32, i32, i32, i32, i32)
define i32 @caller(i32 %a1, i32 %a2, i32 %a3, i32 %a4, i32 %a5, i32 %a6, i32 %a7) {
  %1 = tail call i32 @callee(i32 %a7, i32 %a6, i32 %a5, i32 %a4, i32 %a3, i32 %a2, i32 %a1)
  ret i32 %1
}

> However, LLVM doesn't emit a tail call at -O0.
> 
> Maybe what the documentation means to say is that tail call
> optimisations are only guaranteed to be done when using fastcc etc.?

It's a bit ambiguous because the author presumably intended to
distinguish tail calls from sibling calls as is the convention in the
C world.

>     Further, I notice that the document includes
>     "call" but not "tail call" in the list of supported instructions.
> 
> 
> That's an omission in the document.  We do actually allow the "tail
> call" instruction in PNaCl.

Great.

> However, we haven't specified any guarantees that a tail call
> optimisation will happen.  I suppose we could specify some
> guarantees if people want this.

I found a thread where Evan Cheng proposed removing -tailcallopt from
LLVM.  It resulted in a number of language implementers who require
guaranteed tail calls coming out of the woodwork to express their
displeasure at the idea.  It also resulted in some good discussion
about why this is necessary and desirable.

  http://lists.cs.uiuc.edu/pipermail/llvmdev/2010-February/029191.html

The ability to guarantee tail call elimination is required for
implementing functional languages.  The Scheme standard requires all
conforming implementations to guarantee that tail calls are
eliminated.

Microsoft would probably have never tried building F# on the CLR if
the CLR could not promise this.

Functional languages see tail call elimination as a matter of language
semantics rather than as an optimization.

> Maybe we could say that a "tail call" instruction is guaranteed to
> be turned into a real tail call if the callee function has the same
> number of arguments as the caller function or fewer?  I think that
> would work on all the architectures PNaCl targets.

Well, it would certainly be better than not doing it, and perhaps it
could be a step toward broader support, but it doesn't really solve
the issue as functional languages rely on the semantics of tail calls
being eliminated in all possible cases.

What you're describing here is more general than sibling calls, as
sibling calls have additional constraints such as requiring all
arguments passed to the callee on the stack come from the caller's
incoming argument stack.

We really should find a plan for achieving support for full tail call
elimination.  Otherwise we'll end up with a rather embarrassing
regression as compared to the CLR (and why should Microsoft have all
the fun?).

> Then we would have to change -O0 translation so that the guarantee
> is provided consistently at all optimisation levels.

That'd be ideal, yes.  When the client runs llc, is it going to run it
at a fixed or client determined optimization level or one specified by
metadata that follows the code?

Best,

-- TC