<div dir="ltr">I don't know who wrote that description of TRE, but your fib implementation is obviously not tail recursive, so I don't know what we could do with it.  Maybe the author had this version in mind:<div>

<br></div><div><div>int fib_help(int a, int b, int n) {</div><div>  if (n > 0)</div><div>    return fib_help(b, a+b, n-1);</div><div>  return a;</div><div>}</div><div>int fib(int n) {</div><div>  return fib_help(0, 1, n);</div>

<div>}</div><div><br></div></div><div>TRE can turn the helper into a loop here.</div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Thu, May 8, 2014 at 8:57 AM, Nicola Gigante <span dir="ltr"><<a href="mailto:nicola.gigante@gmail.com" target="_blank">nicola.gigante@gmail.com</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hello everybody,<br>

<br>

On the documentation page for the tailcallelim pass you can read:<br>

<br>

"This pass transforms functions that are prevented from being tail recursive by an<br>

associative expression to use an accumulator variable, thus compiling the typical<br>

naive factorial or fib implementation into efficient code”<br>

<br>

However, I don’t see this behavior when trying to compile this variant of<br>

the mentioned “typical naive fib implementation”:<br>

<br>

unsigned int fib(unsigned int n) {<br>

    return n < 2 ? n : fib(n-1) + fib(n-2);<br>

}<br>

<br>

The IR with clang -O3 (version 3.4) is this:<br>

define i32 @_Z9fibj(i32 %n) #0 {<br>

  %1 = icmp ult i32 %n, 2<br>

  br i1 %1, label %8, label %2<br>

<br>

; <label>:2                                       ; preds = %0<br>

  %3 = add i32 %n, -1<br>

  %4 = tail call i32 @_Z9fibj(i32 %3)<br>

  %5 = add i32 %n, -2<br>

  %6 = tail call i32 @_Z9fibj(i32 %5)<br>

  %7 = add i32 %6, %4<br>

  ret i32 %7<br>

<br>

; <label>:8                                       ; preds = %0<br>

  ret i32 %n<br>

}<br>

<br>

You see that the tail call is not eliminated. However, it does get eliminated<br>

if I change to this code:<br>

<br>

unsigned int fib(unsigned int n) {<br>

    return n <= 2 ? 1 : fib(n-1) + fib(n-2);<br>

}<br>

<br>

IR:<br>

define i32 @_Z9fibj(i32 %n) #0 {<br>

  %1 = icmp ult i32 %n, 3<br>

  br i1 %1, label %tailrecurse._crit_edge, label %tailrecurse<br>

<br>

tailrecurse:                                      ; preds = %0, %tailrecurse<br>

  %n.tr2 = phi i32 [ %4, %tailrecurse ], [ %n, %0 ]<br>

  %accumulator.tr1 = phi i32 [ %5, %tailrecurse ], [ 1, %0 ]<br>

  %2 = add i32 %n.tr2, -1<br>

  %3 = tail call i32 @_Z9fibj(i32 %2)<br>

  %4 = add i32 %n.tr2, -2<br>

  %5 = add i32 %3, %accumulator.tr1<br>

  %6 = icmp ult i32 %4, 3<br>

  br i1 %6, label %tailrecurse._crit_edge, label %tailrecurse<br>

<br>

tailrecurse._crit_edge:                           ; preds = %tailrecurse, %0<br>

  %accumulator.tr.lcssa = phi i32 [ 1, %0 ], [ %5, %tailrecurse ]<br>

  ret i32 %accumulator.tr.lcssa<br>

}<br>

<br>

Now, note that gcc 4.8 compiles well both the examples, and the<br>

resulting loop gets also unrolled a dozen times.<br>

<br>

I’m compiling with -O3 with both compilers.<br>

<br>

So what is stopping the tail call elimination pass to doing the same<br>

thing?<br>

<br>

Note that I don’t write this email because I need a performing<br>

fibonacci function… What I’m trying to do is to track down the<br>

reason why fortran code is performing better than C on the<br>

source code used as benchmark in the main page of the<br>

Julia language in [1]. I wasn’t expecting this huge difference<br>

between fortran and C code on an example as simple as this, so I<br>

tried to investigate, even if I don’t know anything about fortran.<br>

Their source code use int instead of unsigned as in my example.<br>

I don’t know quite why, but this slows down everything both in gcc<br>

and clang, because of something related to how the two different<br>

languages handle overflow in integer arithmetic, I think. When I use<br>

unsigned int, gcc performance matches gfortran’s, and the generated<br>

code is nearly identical, while clang stays way behind because of the<br>

missing tail call elimination.<br>

<br>

Note that even with the modified example where the tail call does<br>

get eliminated, a simple benchmark shows that clang’s code<br>

is still 2-3x times slower than gcc's, I suppose because of the missing<br>

unrolling.<br>

<br>

Note that in that benchmark, Julia seems winning too, and they use<br>

LLVM as the backend so something’s missing...<br>

<br>

Can you help me understand what’s happening?<br>

<br>

Thank you,<br>

Nicola<br>

<br>

[1] <a href="http://julialang.org" target="_blank">http://julialang.org</a><br>

<br>

<br>

<br>

<br>

<br>

<br>

<br>

_______________________________________________<br>

LLVM Developers mailing list<br>

<a href="mailto:LLVMdev@cs.uiuc.edu">LLVMdev@cs.uiuc.edu</a>         <a href="http://llvm.cs.uiuc.edu" target="_blank">http://llvm.cs.uiuc.edu</a><br>

<a href="http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev" target="_blank">http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev</a><br>

</blockquote></div><br></div>