[llvm-dev] [RFC] Adding CPS call support

Tue Apr 18 13:08:03 PDT 2017

Before I try to respond to all of these points, let's consider limiting the scope of the proposed changes:

Instead of adding a 'cps' call marker, what if I were to add a custom lowering (during isel) for calls marked with the 'ghccc' calling convention? There is already language specific lowering in isel for Swift, so I imagine this would more acceptable?

~kavon

> On Apr 18, 2017, at 7:27 PM, Philip Reames <listmail at philipreames.com> wrote:
> 
> 
> 
> On 04/17/2017 04:04 PM, Kavon Farvardin via llvm-dev wrote:
>>> Is there a reason you can't use the algorithm from the paper "A Correspondence between Continuation Passing Style and Static Single Assignment Form" to convert your IR to LLVM's SSA IR?
>> 
>> Yes, there are a few reasons.
>> 
>> Undoing the CPS transformation earlier in the pipeline would mean that we are using LLVM's built-in stack. The special layout and usage of the stack in GHC is achieved through CPS, so it is baked the compiler and garbage-collected runtime system.
> Can you give a bit more detail here?  LLVM does provide support for describing GC frame maps.
> 
> p.s. You're going to have to justify the design of the runtime a bit here. Extending the IR to workaround a buggy or poorly structured runtime is not going to be sufficient justification.  *Why* does the runtime need the specific runtime stack structure used?  What alternatives exist and why should those be rejected?
>> 
>> ~kavon
>> 
>>> On Apr 17, 2017, at 8:56 PM, Manuel Jacob <me at manueljacob.de> wrote:
>>> 
>>> Hi Kavon,
>>> 
>>> Is there a reason you can't use the algorithm from the paper "A Correspondence between Continuation Passing Style and Static Single Assignment Form" to convert your IR to LLVM's SSA IR?
>>> 
>>> -Manuel
>>> 
>>> On 2017-04-17 17:30, Kavon Farvardin via llvm-dev wrote:
>>>> Summary
>>>> =======
>>>> There is a need for dedicated continuation-passing style (CPS) calls in LLVM to
>>>> support functional languages. Herein I describe the problem and propose a
>>>> solution. Feedback and/or tips are greatly appreciated, as our goal is to
>>>> implement these changes so they can be merged into LLVM trunk.
>>>> Problem
>>>> =======
>>>> Implementations of functional languages like Haskell and ML (e.g., GHC and
>>>> Manticore) use a continuation-passing style (CPS) transformation in order to
>>>> manage the call stack explicitly. This is done prior to generating LLVM IR, so
>>>> the implicit call stack within LLVM is not used for call and return.
>>>> When making a non-tail call while in CPS, we initialize a stack frame for the
>>>> return through our own stack pointer, and then pass that stack pointer to the
>>>> callee when we jump to it. It is here when we run into a problem in LLVM.
>>>> Consider the following CPS call to @bar and how it will return:
>>>> ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
>>>> define void @foo (i8** %sp, ...) {
>>>> someBlk:
>>>>    ; ...
>>>>    ; finish stack frame by writing return address
>>>>  %retAddr = blockaddress(@foo, %retpt)
>>>>  store i8* %retAddr, i8** %sp
>>>>    ; jump to @bar
>>>>  tail call void @bar(i8** %sp, ...)
>>>> retpt: 				; <- how can @bar "call" %retpt?
>>>>   %sp2 = ???
>>>>   %val = ???
>>>>   ; ...
>>>> }
>>>> define void @bar (i8** %sp, ...) {
>>>> 	  ; perform a return
>>>> 	%retAddr0 = load i8*, i8** %sp
>>>> 	%retAddr1 = bitcast i8* %retAddr0 to void (i8**, i64)*
>>>> 	%val = bitcast i64 1 to i64
>>>> 	  ; jump back to %retpt in @foo, passing %sp and %val
>>>> 	tail call void %retAddr1(i8** %sp, i64 %val)
>>>> }
>>>> ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
>>>> There is currently no way to jump back to %retpt from another function, as block
>>>> addresses have restricted usage in LLVM [1]. Our main difficulty is that we
>>>> cannot jump to a block address without knowing its calling convention, i.e., the
>>>> particular machine registers (or memory locations) that the block expects
>>>> incoming values to be passed in.
>>>> The workaround we have been using in GHC for LLVM is to break apart every
>>>> function, placing the code for the continuation of each call into a new
>>>> function. We do this only so that we can store a function pointer instead of a
>>>> block address to our stack. This particularly gross transformation inhibits
>>>> optimizations in both GHC and LLVM, and we would like to remove the need for it.
>>>> Proposal
>>>> ========
>>>> I believe the lowest-impact method of fixing this problem with LLVM is the
>>>> following:
>>>> First, we add a special 'cps' call instruction marker to be used on non-tail
>>>> calls. Then, we use a specialized calling convention for these non-tail calls,
>>>> which fix the returned values to specific locations in the machine code [2].
>>>> To help illustrate what's going on, let's rewrite the above example using the
>>>> proposed 'cps' call:
>>>> ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
>>>> define { ... } @foo (i8** %sp, ...) {
>>>> someBlk:
>>>>    ; ...
>>>>    ; finish stack frame by writing return address
>>>>  %retAddr = blockaddress(@foo, %retpt)
>>>>  store i8* %retAddr, i8** %sp
>>>>    ; jump to @bar
>>>>  %retVals = cps call ghccc {i8**, i64} @bar (i8** %sp, ...)
>>>>  br label %retpt
>>>> retpt:
>>>>   %sp2 = extractvalue {i8**, i64} %retVals, 0
>>>>   %val = extractvalue {i8**, i64} %retVals, 1
>>>>   ; ...
>>>> }
>>>> define {i8**, i64} @bar (i8** %sp, ...) {
>>>> 	  ; perform a return
>>>> 	%retAddr0 = load i8*, i8** %sp
>>>> 	%retAddr1 = bitcast i8* %retAddr0 to {i8**, i64} (i8**, i64)*
>>>> 	%val = bitcast i64 1 to i64
>>>> 	  ; jump back to %retpt in @foo, passing %sp and %val
>>>> 	tail call ghccc void %retAddr1(i8** %sp, i64 %val)
>>>> 	unreachable   ; <- ideally this would be our terminator,
>>>> 	              ; but 'unreachable' breaks TCO, so we will
>>>> 	              ; emit a ret of the struct "returned" by the call.
>>>> }
>>>> ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
>>>> The important point here is that the 'cps' marked call will lower to a jump. The
>>>> 'cps' call marker means that the callee knows how to return using the arguments
>>>> explicitly passed to it, i.e., the stack pointer %sp. The callee cannot use a
>>>> 'ret' instruction if it is 'cps' called.
>>>> Either before or during 'cps' call lowering, any instructions following the
>>>> 'cps' call to @bar are sunk into the the block %retpt, and the unconditional
>>>> branch to %retpt is deleted/ignored. We include that branch to preserve
>>>> control-flow information for LLVM IR optimization passes.
>>>> The 'extractvalue' instructions are what ensure the calling convention of
>>>> %retpt, since the fields of the struct %retVals are returned in physical
>>>> registers dictated by the (modified) ghccc convention. Those same physical
>>>> registers are used by the ghccc tail call in @bar when it jumps back to %retpt.
>>>> So, the call & return convention of ghccc ensures that everything matches up.
>>>> Interaction with LLVM
>>>> =====================
>>>> (1) Caller-saved Values
>>>> One may wonder how this would work if there are caller-saved values of the 'cps'
>>>> call. But, in our first example, which closely matches what CPS code looks like,
>>>> the call to @bar was in tail position. Thus, in the second example, there are no
>>>> caller-saved values for the 'cps' call to @bar, as all live values were passed
>>>> as arguments in the call.
>>>> This caller-saved part is a bit subtle. It works fine in my experience [2] when
>>>> @bar is a function not visible to LLVM. My impression is that even if @bar is
>>>> visible to LLVM, there is still no issue, but if you can think of any corner
>>>> cases that would be great!
>>>> (2) Inlining
>>>> My gut feeling is that we cannot inline a 'cps' marked call-site without more
>>>> effort. This is because we might end up with something odd like this once the
>>>> dust settles:
>>>>    %retAddr = blockaddress(@foo, %retpt)
>>>>    %retAddr1 = bitcast i8* %retAddr to {i8**, i64} (i8**, i64)*
>>>>    tail call ghccc %retAddr1 ( %sp, ... )
>>>> We could add a pass that turns the above sequence into just an unconditional
>>>> branch to %retpt, using a phi-node to replace each 'extractvalue' instruction in
>>>> that block.
>>>> I'm not sure whether inlining in LLVM is important for us yet, as we tend to
>>>> inline quite a lot before generating LLVM IR. I don't think this additional fix-
>>>> up pass would be too difficult to implement if it's desired.
>>>> Implementation Sketch and Conclusion
>>>> ====================================
>>>> My current plan is to add this special lowering of 'cps' calls during the
>>>> translation from LLVM IR to the SelectionDAG. I welcome any suggestions or tips
>>>> on the best way to approach this. An important goal for us is to merge this into
>>>> trunk since we do not want to bundle a special version of LLVM with GHC.
>>>> Please let me know soon if you have any objections to this feature.
>>>> Thanks for reading,
>>>> Kavon
>>>> References
>>>> ==========
>>>> [1] http://llvm.org/docs/LangRef.html#blockaddress
>>>> [2] http://kavon.farvard.in/papers/ml16-cwc-llvm.pdf
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev