<div dir="ltr">Sorry for jumping late into this thread, I'm quite late in my mail reading after CGO. Here's my experience with this topic (I presented this on a binary translation workshop of CGO):<div><br></div><div>I sort of stumbled upon this transformation by accident in the last year. I do research with binary translation and LLVM -- I have a prototype that reads MIPS binary code and mimics the behavior of a MIPS guest machine with LLVM IR code, and then use an LLVM backend to emulate the guest code on other host machines (x86 and ARM).</div><div><br></div><div>I guess you already see where this is going after all this state machine discussion. One day I decided to experiment with MIPS function calls being modelled in the IR as jumps to basic blocks instead of calls to LLVM functions. In this way, the whole MIPS program would be translated into a single LLVM function, removing any recursion, if present.</div><div><br></div><div>After this, two of my very simple benchmarks (Fibonacci and Ackermann, two recursive ones) began to show better performance *after* being compiled to MIPS and then retranslated to x86, which didn't make sense, because binary translation always imposes some overhead that comes from emulating the behavior of another machine. Upon further investigation, the culprit was that the recursive functions were transformed into a loop in a way that is similar to what you described: the stack you refer to is my MIPS guest stack that I need to emulate on top of x86. Nothing special, I'm just emulating a processor in software, so, upon hitting a return instruction, I check the return address to decide to which basic block to jump to (whether it's time to exit the loop or not), similar to your control.</div><div><br></div><div>The performance gain was so expressive that it was mitigating the binary translator overhead and yet beating native x86 performance by 2x in the first implementation (LLVM 3.3 running in an old Intel Core2quad). Now in trunk and running on a recent processor, it is a much more modest speedup of 20%. But I only observed this for simple benchmarks, such as Fibonacci and Ackermann. If you didn't implement this transformation yet and you are curious about its impact, I can test this framework with other simple workloads, as long as it doesn't break my prototype :-) It's not ideal though, since it has a considerable overhead from emulating a guest MIPS platform.</div><div><br></div><div>However, when translating to ARM, I could not observe the same gains seen on x86 because the way I performed the binary translation imposed a higher overhead on ARM (Cortex A9) platforms than on x86 (Haswell) ones.</div><div><br></div><div>Here are the results for LLVM trunk of Jan 19 (my last rebase). I'm using the full -O3 opt pipeline after I translate the code:</div><div><br></div><div>Fibonacci: </div><div>Native x86: 1.56s</div><div>Translated MIPS to x86: 1.26s (-20.75%)</div><div><br></div><div>Native ARM: 6.42s</div><div>Translated MIPS to ARM: 6.49s (overhead of 9%)</div><div><br></div><div>Ackermann:</div><div>Native x86: 0.94s</div><div>Translated MIPS to x86: 0.57s (-38.30%)</div><div><br></div><div>Regards,</div><div>Rafael Auler</div><div><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Feb 6, 2015 at 12:34 AM, Hal Finkel <span dir="ltr"><<a href="mailto:hfinkel@anl.gov" target="_blank">hfinkel@anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">----- Original Message -----<br>
> From: "James Molloy" <<a href="mailto:james@jamesmolloy.co.uk">james@jamesmolloy.co.uk</a>><br>
</span><span class="">> To: "Hal Finkel" <<a href="mailto:hfinkel@anl.gov">hfinkel@anl.gov</a>><br>
> Cc: <a href="mailto:llvmdev@cs.uiuc.edu">llvmdev@cs.uiuc.edu</a>, "Chandler Carruth" <<a href="mailto:chandlerc@google.com">chandlerc@google.com</a>>, "pablo barrio" <<a href="mailto:pablo.barrio@arm.com">pablo.barrio@arm.com</a>><br>
> Sent: Thursday, February 5, 2015 9:18:35 AM<br>
> Subject: Re: RFC: Recursive inlining<br>
><br>
><br>
</span><span class="">> Hi Hal,<br>
><br>
> > As we had briefly mentioned on IRC, one way of forming this<br>
> > 'stack', and its associated 'cnt' variable, is to use dynamic<br>
> > stack allocation.<br>
><br>
><br>
><br>
> I hadn't really reached a decision on the mechanics yet. However,<br>
> your suggestion while it can work within the current abilities of<br>
> the IR, has the disadvantage that it is using an extra stack slot<br>
> for the link pointer. I think if we are going to apply this<br>
> optimization, we must make sure we aren't going to blow the stack.<br>
><br>
><br>
> As users may already be just close enough to the max-stack boundary<br>
> that their applications don't crash, I think we should aim as much<br>
> as possible to keep the stack usage no higher than the recursive<br>
> equivalent. Holding the "state" is the equivalent of a return<br>
> address, so adding a stack link pointer would be the equivalent of a<br>
> frame address I suppose... That could be acceptable?<br>
><br>
<br>
</span>I agree. We should aim not to increase the amount of stack space used. This will require some target knowledge (because dynamic stack allocation has some target-dependent overhead because of alignment and other ABI requirements).<br>
<span class="HOEnZb"><font color="#888888"><br>
-Hal<br>
</font></span><span class="im HOEnZb"><br>
><br>
> As I say, I haven't thought about the deep mechanics just yet.<br>
><br>
><br>
> James<br>
><br>
> On Thu Feb 05 2015 at 2:48:48 PM Hal Finkel < <a href="mailto:hfinkel@anl.gov">hfinkel@anl.gov</a> ><br>
> wrote:<br>
><br>
><br>
> ----- Original Message -----<br>
> > From: "James Molloy" < <a href="mailto:james@jamesmolloy.co.uk">james@jamesmolloy.co.uk</a> ><br>
</span><div class="HOEnZb"><div class="h5">> > To: <a href="mailto:llvmdev@cs.uiuc.edu">llvmdev@cs.uiuc.edu</a> , "Chandler Carruth" < <a href="mailto:chandlerc@google.com">chandlerc@google.com</a><br>
> > >, "Hal Finkel" < <a href="mailto:hfinkel@anl.gov">hfinkel@anl.gov</a> >, "pablo barrio"<br>
> > < <a href="mailto:pablo.barrio@arm.com">pablo.barrio@arm.com</a> ><br>
> > Sent: Thursday, February 5, 2015 4:36:16 AM<br>
> > Subject: RFC: Recursive inlining<br>
> ><br>
> ><br>
> > Hi,<br>
> ><br>
> ><br>
> > The other day on IRC myself, Chandler, Hal and Pablo discussed<br>
> > recursive inlining. I've pondered a bit since then, and thought I'd<br>
> > get some background on the list with the aim of stimulating<br>
> > discussion.<br>
> ><br>
> ><br>
> > Our goal is to inline recursive functions. We have several<br>
> > workloads<br>
> > with recursive calls, and inlining them to a certain degree can<br>
> > improve performance by up to 30%.<br>
> ><br>
> ><br>
> > Inlining recursive functions is a slightly different problem to<br>
> > inlining non-recursive functions in that the inlining can never<br>
> > terminate, so you need to choose a factor (number of times) to<br>
> > inline. In this respect it is similar to loop unrolling, and<br>
> > Chandler suggested we should model it as such.<br>
> ><br>
> ><br>
> > We originally thought he meant by this that the mechanics of<br>
> > inlining<br>
> > would stay the same, but the heuristics would have to be changed.<br>
> > This is the approach taken by Rugina & Rinard [1], which is almost<br>
> > the sum of the literature on this topic I could find, which<br>
> > surprised me. However, the actual suggestion was to remove the<br>
> > recursion entirely, and replace it with a loop.<br>
> ><br>
> ><br>
> > Before I get into the mechanics of this, I'd like to explain what<br>
> > the<br>
> > possible optimization opportunities are for recursive functions<br>
> > turned into a loop.<br>
> ><br>
> ><br>
> > 1. Unrolling<br>
> > - This leads to bigger basic blocks (if they can be merged) and<br>
> > exposes more pipeline-level parallelism.<br>
> > - CSE between iterations.<br>
> > - Reduces backbranch overhead (generally negligible impact).<br>
> > 2. Commoning / LICM<br>
> > - Recursion doesn't have a vehicle for moving common code out into<br>
> > a<br>
> > dominating block (compare with LICM for loops).<br>
> ><br>
> ><br>
> > OK, first example:<br>
> ><br>
> ><br>
> > void ex1(i) {<br>
> > if (i == 0) return;<br>
> > a[i] = i+1;<br>
> > ex1(i-1);<br>
> > }<br>
> ><br>
> ><br>
> > gets converted to:<br>
> ><br>
> ><br>
> > while (1) {<br>
> > if (i == 0) break;<br>
> > a[i] = i+1;<br>
> > i = i-1;<br>
> > }<br>
> ><br>
> ><br>
> > Now, the above is the simplest case and is pure tail recursion. We<br>
> > can eliminate this currently with the TailRecursionElimination<br>
> > pass.<br>
> > The more complex case is where we have non-tail recusion and live<br>
> > variables.<br>
> ><br>
> ><br>
> > void ex2(i) {<br>
> > if (i == 0) return;<br>
> > j = a[i];<br>
> > ex2(i-1)<br>
> > b[j] = i;<br>
> > }<br>
> ><br>
> ><br>
> > This requires a stack being modelled and could be converted to:<br>
> ><br>
> ><br>
> > cnt = 0;<br>
> > while (1) {<br>
> > if (i == 0) break;<br>
> > j = a[i];<br>
> > stack[cnt++] = j;<br>
> > stack[cnt++] = i;<br>
> > ++cnt;<br>
> > --i;<br>
> > }<br>
> > while (cnt) {<br>
> > i = stack[--cnt];<br>
> > j = stack[--cnt];<br>
> > b[j] = i;<br>
> > }<br>
> ><br>
> ><br>
> > The above is as far as the discussion got on IRC. The following are<br>
> > my half-baked thoughts.<br>
> ><br>
> ><br>
> > The above transform works because the access pattern to the stack<br>
> > is<br>
> > strictly linear. Pushes are followed by pops. But I don't think<br>
> > this<br>
> > is the most interesting case for recursion in real programs. I see<br>
> > (qualitatively, poorly) 3 reasons for recursion in code that should<br>
> > go fast:<br>
> ><br>
> ><br>
> > 1. A fibonnacci function in a microbenchmark or compiler shootout.<br>
> > I<br>
> > think we can safely ignore this usecase...<br>
> > 2. Divide and conquer algorithms. Quicksort, FFT butterflies, etc.<br>
> > 3. Traversing a data structure, where the recursion is simply to<br>
> > get<br>
> > to the leaves which is where all the fun stuff happens.<br>
> ><br>
> ><br>
> > Points 2 and 3 both share a trait that they're dividing/fanning out<br>
> > at each step:<br>
> ><br>
> ><br>
> > void ex3(i) {<br>
> > if (is_base(i)) {<br>
> > f();<br>
> > return;<br>
> > }<br>
> > g(i);<br>
> > ex3(i / 2);<br>
> > ex3(i / 2 + 1);<br>
> > }<br>
> ><br>
> ><br>
> > The access pattern of such a function is not linear. It is a<br>
> > pre-order walk of a (binary, in this case) tree. As such we can't<br>
> > use the two-loop transform above, but would have to form one loop<br>
> > and just manually implement the stack operations:<br>
> ><br>
> ><br>
> > cnt = 0<br>
> > stack[cnt++] = i;<br>
> > while (1) {<br>
> > i = stack[--cnt];<br>
> > if (is_base(i)) {<br>
> > f(); continue;<br>
> > }<br>
> > g(i);<br>
> > stack[cnt++] = i / 2 + 1;<br>
> > stack[cnt++] = i / 2;<br>
> > }<br>
> ><br>
> ><br>
> > OK, it's still a well-formed loop, we can still do useful<br>
> > operations<br>
> > on it like LICM or, if it's profitable, unrolling.<br>
> ><br>
> ><br>
> > Now what about a more complex case, like in the FFT butterfly<br>
> > algorithm:<br>
> ><br>
> ><br>
> > void ex4(i) {<br>
> > if (is_base(i)) {<br>
> > f(i); return;<br>
> > }<br>
> > g(i);<br>
> > ex4(i / 2);<br>
> > ex4(i / 2 + 1);<br>
> > h(i);<br>
> > }<br>
> ><br>
> ><br>
> > Here, we have to do work after the last recursive call. This<br>
> > presents<br>
> > a problem, because we don't have any of the caller's context when<br>
> > dealing with a node. Is a node N the first recursive call? or the<br>
> > second? Call-stack recursion can save two things: the state of<br>
> > variables and the return address which serves as another kind of<br>
> > state. The obvious answer is to turn this into a sort of state<br>
> > machine:<br>
> ><br>
> ><br>
> > cnt = 0<br>
> > stack[cnt++] = i;<br>
> > stack[cnt++] = STATE_1;<br>
> > while (1) {<br>
> > state = stack[--cnt];<br>
> > i = stack[--cnt];<br>
> ><br>
> ><br>
> > if (state == STATE_FINISH) {<br>
> > h(i);<br>
> > continue;<br>
> > }<br>
> ><br>
> ><br>
> > if (is_base(i)) {<br>
> > f(); continue;<br>
> > }<br>
> ><br>
> ><br>
> > g(i);<br>
> > stack[cnt++] = i; // Push ourself again, in the final state this<br>
> > time<br>
> > stack[cnt++] = STATE_FINISH;<br>
> > stack[cnt++] = i / 2 + 1;<br>
> > stack[cnt++] = STATE_1;<br>
> > stack[cnt++] = i / 2;<br>
> > stack[cnt++] = STATE_1;<br>
> > }<br>
><br>
> As we had briefly mentioned on IRC, one way of forming this 'stack',<br>
> and its associated 'cnt' variable, is to use dynamic stack<br>
> allocation. You can do this in general at the IR level (using<br>
> allocas that just happen not to be in the entry block). If each<br>
> stack block holds a pointer to the previous one (as a linked list),<br>
> then you just need to maintain a pointer to the current block, and<br>
> you don't need a separate 'cnt' variable (you might also keep a<br>
> pointer to the next block for allocation reuse). Is that what you<br>
> had in mind?<br>
><br>
> -Hal<br>
><br>
> ><br>
> ><br>
> > This solution can be generalised for if there is code in between<br>
> > the<br>
> > recursive calls (just add more states). It's getting more complex<br>
> > in<br>
> > terms of control flow though - the question is, is it worth it?<br>
> ><br>
> ><br>
> > Unrolling this loop is probably not going to usefully create larger<br>
> > basic blocks due to how complex the control flow is. I suppose we<br>
> > have the opportunity, if we unroll, of CSE and jump threading<br>
> > values<br>
> > through equivalent blocks that may gain us time. However, we can<br>
> > still do LICM which is probably a major improvement, and there are<br>
> > no calls/returns to speculate.<br>
> ><br>
> ><br>
> > This approach would have to be extended if the recursive calls<br>
> > didn't<br>
> > return void (for example it was a recursive reduction), but it's<br>
> > not<br>
> > impossible.<br>
> ><br>
> ><br>
> > Sorry for the long post. What do people think?<br>
> ><br>
> ><br>
> > Cheers,<br>
> ><br>
> ><br>
> > James<br>
> ><br>
> ><br>
> > [1]: <a href="http://link.springer.com/" target="_blank">http://link.springer.com/</a> chapter/10.1007/3-540-45574-4_ 3<br>
><br>
> --<br>
> Hal Finkel<br>
> Assistant Computational Scientist<br>
> Leadership Computing Facility<br>
> Argonne National Laboratory<br>
><br>
<br>
--<br>
Hal Finkel<br>
Assistant Computational Scientist<br>
Leadership Computing Facility<br>
Argonne National Laboratory<br>
</div></div><div class="HOEnZb"><div class="h5">_______________________________________________<br>
LLVM Developers mailing list<br>
<a href="mailto:LLVMdev@cs.uiuc.edu">LLVMdev@cs.uiuc.edu</a> <a href="http://llvm.cs.uiuc.edu" target="_blank">http://llvm.cs.uiuc.edu</a><br>
<a href="http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev" target="_blank">http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev</a><br>
</div></div></blockquote></div><br></div>