[llvm-dev] [RFC] Adding CPS call support

Wed Apr 19 09:49:14 PDT 2017

To best address your concerns Eli, it's helpful for me to explain things in the context of a short example of the transformation I have in mind. The example I've created can be found here:

https://gist.github.com/kavon/b80db874cf168bd0625cdd8ef7464f56

I'll note that cc18 is just a custom calling convention that passes struct fields in particular registers.

Okay, so now onto your points:

> If LLVM can prove through interprocedural optimization that a function always returns the value of an argument, it will replace uses of the return value of the call with uses of the argument.

Every function return in CPS looks exactly as written in lines 19-22 in 1_cps_call.ll in the example link. Unless if the interprocedural analysis can reason about a tail call to a block address loaded from memory, the replacement you describe would not occur.

Even if %retAddr1 on line 21 were instead, say, the known function @returnFunc, the replacement still shouldn't occur. The analysis would be looking at 'ret' instructions to find out what is eventually returned by all of these tail calls, but the chain of tail calls is endless, we _never_ execute a 'ret' instruction. The program terminates when we effectively tail call the _exit function.

> If your code uses a large integer constant (like 0x123456789) multiple times in the same function, LLVM will load it into a register and reuse it across calls.  Same for floating-point constants.

This seems reasonable, but this choice made is made by the register allocator, right?

The proposed lowering during isel is in the file 2_isel_idea.ll in the example link. The important part is that the 'cps' call was non-tail in the IR to preserve the control-flow for IR passes, but it will be a tail call when we initially build the SelectionDAG.

Thus, the register allocator should not have a chance to keep a constant in register... BB#1 doesn't even have a predecessor! Instead, it now has a fixed register convention (the physical register constraints from the cc18 convention) and its block address was taken, so the block won't be deleted.

I have not modified the isel code before, but I imagine this lowering is not too difficult to achieve?

> And even if you could force LLVM not to keep any values live across your calls, there would be no point: LLVM optimizations wouldn't be able to tell which values are equivalent before and after the call. The end result is exactly equivalent to generating multiple functions.

The point of this proposal is so that we do not have to split the whole program hundreds of tiny functions. Here's an example of how ugly this transformation is:

By the time one of the modules of a Haskell benchmark program has reached the backend, there are 184 functions. In order to generate LLVM IR as of now, we had to break the program into 715 functions with a complicated analysis and transformation.

In an attempt to piece the program back together, LLVM's inliner yields 666 out of 715 functions inlined, but only 51 functions were actually deleted after inlining (their addresses are taken, so most cannot be deleted). On average, the geomean size of a program increases by 25% when using the LLVM inliner, and yields a 4.5% reduction in program run-time by doing so. 

Thus, in terms of improving the code generated by LLVM, I think the proposal will be profitable. As I mentioned in another reply, instruction cache locality is an obvious advantage of the proposal. I don't have a prototype written yet, so I can't tell you what else improves.

~kavon

> On Apr 18, 2017, at 6:59 PM, Friedman, Eli <efriedma at codeaurora.org> wrote:
> 
> On 4/18/2017 3:47 AM, Kavon Farvardin wrote:
>>> Most architectures have a call instruction which does not push anything onto the stack; e.g. on ARM, the "BL" instruction saves the return address in LR.  On other architectures, you can emulate this (for example, you could lower an IR "call" to LEA+JMP on x86-64).
>> This is similar to what I was originally thinking, since the end goal of all of this is the following machine code for a call (I'll use an x86-64 example):
>> 
>>     leaq  _retpt(%rip), %scratchReg
>>     movq  %scratchReg, (%ghcSPReg)
>>     jmp   _bar
>> 
>> But, if we want to end up with the above machine code using a custom lowering of just the following IR instruction,
>> 
>>     %retVals = cps call ghccc @bar (... args ...)
>> 
>> _without_ explicitly taking the block address and writing it to the GHC stack pointer beforehand, there are some things to consider:
>> 
>> 1. How do we prevent IR optimizations from eliminating the %retpt block?
>> 
>>   If we do not explicitly take the address of the %retpt block, a pass such as simplifycfg has no reason to preserve the block, and may merge it with its only predecessor.
> If the return address is implicit, it's just the instruction immediately after the call; it doesn't need to be explicitly represented in the CFG.
> 
> If you really do need to represent the address explicitly somehow, you could use something like an "invoke", which has branch destinations integrated into the call.
>>> The return address for the current function is fundamentally live across any non-tail call.
>> I'm not quite sure what you mean by this.
>> 
>> While there may be a return address handed to the main function (and thus passed to every other function) it's either unused or not known to LLVM.
> Any function which has a "ret" IR instruction must save the address to return to somewhere.  I guess you could generate code without any "ret" instructions (tail-call with call+unreachable), but that's awkward because you can't return to the non-CPS code which called into your code.
>>> And LLVM will hoist other operations across non-tail calls, and in the process introduce values which are live across calls.  You need to save those values somewhere.  The key question is where.  Your proposal tries to address that by explicitly saving/restoring the return address onto the GHC stack, but you're working at the wrong level; you can't get a complete list of which values are live across calls until register allocation.
>> The part I had omitted from the proposal is the initialization of the rest of the GHC stack frame, which is laid out in GHC by finding all of the values that are live across the call before CPS conversion.
>> 
>> While the full list of values preserved by LLVM are not known until register allocation, there are some properties of the LLVM IR we will initially generate that are important to consider:
>> 
>> 1. There are no IR values live across the non-tail 'cps' IR call, since they are all passed to the callee.
>> 2. All IR values used after the 'cps' IR call come from the struct returned by that call.
>> 
>> Thus, it should not be possible to hoist instructions above this particular non-tail call, as they are all derived from the values returned in the struct. Any register spills allocated to the LLVM stack before the non-tail call are dead, as they have been copied to the GHC stack frame if they were live.
>> 
>> If you have a particular example in mind that would be great.
> 
> If your code uses a large integer constant (like 0x123456789) multiple times in the same function, LLVM will load it into a register and reuse it across calls.  Same for floating-point constants.
> 
> If you reference a global multiple times, LLVM will load the address of that global into a register, and reuse it.  Same for the address of a function (if you're not directly calling it).  And if LLVM can reason about the value of a global (e.g. it's a constant), it could save that value into a register.
> 
> If LLVM can prove through interprocedural optimization that a function always returns the value of an argument, it will replace uses of the return value of the call with uses of the argument.
> 
> And even if you could force LLVM not to keep any values live across your calls, there would be no point: LLVM optimizations wouldn't be able to tell which values are equivalent before and after the call. The end result is exactly equivalent to generating multiple functions.
> 
> -----
> 
> A potential alternative to what you're proposing here is to perform the transform which splits a function into multiple functions on LLVM IR, as opposed to running it before generating IR.  You might want to look at the implementation of C++ coroutines (http://llvm.org/docs/Coroutines.html) for inspiration.
> 
> -Eli
> 
> -- 
> Employee of Qualcomm Innovation Center, Inc.
> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project
>