[LLVMdev] Small problem with the tail call elimination pass

Thu May 8 08:57:40 PDT 2014

Hello everybody,

On the documentation page for the tailcallelim pass you can read:

"This pass transforms functions that are prevented from being tail recursive by an 
associative expression to use an accumulator variable, thus compiling the typical 
naive factorial or fib implementation into efficient code”

However, I don’t see this behavior when trying to compile this variant of
the mentioned “typical naive fib implementation”:

unsigned int fib(unsigned int n) {
    return n < 2 ? n : fib(n-1) + fib(n-2);
}

The IR with clang -O3 (version 3.4) is this:
define i32 @_Z9fibj(i32 %n) #0 {
  %1 = icmp ult i32 %n, 2
  br i1 %1, label %8, label %2

; <label>:2                                       ; preds = %0
  %3 = add i32 %n, -1
  %4 = tail call i32 @_Z9fibj(i32 %3)
  %5 = add i32 %n, -2
  %6 = tail call i32 @_Z9fibj(i32 %5)
  %7 = add i32 %6, %4
  ret i32 %7

; <label>:8                                       ; preds = %0
  ret i32 %n
}

You see that the tail call is not eliminated. However, it does get eliminated
if I change to this code:

unsigned int fib(unsigned int n) {
    return n <= 2 ? 1 : fib(n-1) + fib(n-2);
}

IR:
define i32 @_Z9fibj(i32 %n) #0 {
  %1 = icmp ult i32 %n, 3
  br i1 %1, label %tailrecurse._crit_edge, label %tailrecurse

tailrecurse:                                      ; preds = %0, %tailrecurse
  %n.tr2 = phi i32 [ %4, %tailrecurse ], [ %n, %0 ]
  %accumulator.tr1 = phi i32 [ %5, %tailrecurse ], [ 1, %0 ]
  %2 = add i32 %n.tr2, -1
  %3 = tail call i32 @_Z9fibj(i32 %2)
  %4 = add i32 %n.tr2, -2
  %5 = add i32 %3, %accumulator.tr1
  %6 = icmp ult i32 %4, 3
  br i1 %6, label %tailrecurse._crit_edge, label %tailrecurse

tailrecurse._crit_edge:                           ; preds = %tailrecurse, %0
  %accumulator.tr.lcssa = phi i32 [ 1, %0 ], [ %5, %tailrecurse ]
  ret i32 %accumulator.tr.lcssa
}

Now, note that gcc 4.8 compiles well both the examples, and the
resulting loop gets also unrolled a dozen times.

I’m compiling with -O3 with both compilers.

So what is stopping the tail call elimination pass to doing the same
thing?

Note that I don’t write this email because I need a performing
fibonacci function… What I’m trying to do is to track down the 
reason why fortran code is performing better than C on the
source code used as benchmark in the main page of the 
Julia language in [1]. I wasn’t expecting this huge difference
between fortran and C code on an example as simple as this, so I
tried to investigate, even if I don’t know anything about fortran.
Their source code use int instead of unsigned as in my example.
I don’t know quite why, but this slows down everything both in gcc
and clang, because of something related to how the two different
languages handle overflow in integer arithmetic, I think. When I use
unsigned int, gcc performance matches gfortran’s, and the generated
code is nearly identical, while clang stays way behind because of the
missing tail call elimination.

Note that even with the modified example where the tail call does
get eliminated, a simple benchmark shows that clang’s code
is still 2-3x times slower than gcc's, I suppose because of the missing
unrolling.

Note that in that benchmark, Julia seems winning too, and they use
LLVM as the backend so something’s missing...

Can you help me understand what’s happening?

Thank you,
Nicola

[1] http://julialang.org