[cfe-dev] [llvm-dev] [RFC] ASM Goto With Output Constraints

Mon Jul 1 22:42:46 PDT 2019

On Mon, Jul 1, 2019 at 6:25 PM Finkel, Hal J. <hfinkel at anl.gov> wrote:

> On 7/1/19 1:38 PM, Bill Wendling via llvm-dev wrote:
>
> On Fri, Jun 28, 2019 at 3:35 PM James Y Knight <jyknight at google.com>
> wrote:
>
>> On Fri, Jun 28, 2019 at 5:53 PM Bill Wendling <isanbard at gmail.com> wrote:
>>
>>> On Fri, Jun 28, 2019 at 1:48 PM James Y Knight <jyknight at google.com>
>>> wrote:
>>>
>>>> On Fri, Jun 28, 2019 at 3:00 PM Bill Wendling <isanbard at gmail.com>
>>>> wrote:
>>>>
>>>>> On Thu, Jun 27, 2019 at 1:44 PM Bill Wendling <isanbard at gmail.com>
>>>>> wrote:
>>>>>
>>>>>> On Thu, Jun 27, 2019 at 1:29 PM James Y Knight <jyknight at google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I think this is fine, except that it stops at the point where things
>>>>>>> actually start to get interesting and tricky.
>>>>>>>
>>>>>>> How will you actually handle the flow of values from the callbr into
>>>>>>> the error blocks? A callbr can specify requirements on where its outputs
>>>>>>> live. So, what if two callbr, in different branches of code, specify
>>>>>>> _different_ constraints for the same output, and list the same block as a
>>>>>>> possible error successor? How can the resulting phi be codegened?
>>>>>>>
>>>>>>> This is where I fall back on the statement about how "the programmer
>>>>>> knows what they're doing". Perhaps I'm being too cavalier here? My concern,
>>>>>> if you want to call it that, is that we don't be too restrictive on the new
>>>>>> behavior. For example, the "asm goto" may set a register to an error value
>>>>>> (made up on the spot; may not be a common use). But, if there's no real
>>>>>> reason to have the value be valid on the abnormal path, then sure we can
>>>>>> declare that it's not valid on the abnormal path.
>>>>>>
>>>>>> I think I should explain my "programmer knows what they're doing"
>>>>> statement a bit better. I'm specifically referring to inline asm here. The
>>>>> more general "callbr" case may still need to be considered (see Reid's
>>>>> reply).
>>>>>
>>>>> When a programmer uses inline asm, they're implicitly telling the
>>>>> compiler that they *do* know what they're doing  (I know this is common
>>>>> knowledge, but I wanted to reiterate it.). In particular, either they need
>>>>> to reference an instruction not readily available from the compiler (e.g.
>>>>> "cpuid") or the compiler isn't able to give them the needed performance in
>>>>> a critical section. I'm extending this sentiment to callbr with output
>>>>> constraints. Let's take your example below and write it as "normal" asm
>>>>> statements one on each branch of an if-then-else (please ignore any syntax
>>>>> errors):
>>>>>
>>>>> if:
>>>>>   br i1 %cmp, label %true, label %false
>>>>>
>>>>> true:
>>>>>   %0 = call { i32, i32 } asm sideeffect "poetry $0, $1", "={r8},={r9}"
>>>>> ()
>>>>>   br label %end
>>>>>
>>>>> false:
>>>>>   %1 = call { i32, i32 } asm sideeffect "poetry2 $0, $1",
>>>>> "={r10},={r11}" ()
>>>>>   br label %end
>>>>>
>>>>> end:
>>>>>   %vals = phi { i32, i32 } [ %0, %true ], [ %1, %false ]
>>>>>
>>>>> How is this handled in codegen? Is it an error or does the back-end
>>>>> handle it? Whatever's done today for "normal" inline asm is what I *think*
>>>>> should be the behavior for the inline asm callbr variant. If this doesn't
>>>>> seem sensible (and I realize that I may be thinking of an "in a perfect
>>>>> world" scenario), then we'll need to come up with a more sensible solution
>>>>> which may be to disallow the values on the error block until we can think
>>>>> of a better way to handle them.
>>>>>
>>>>
>>>> This example is no problem, because instructions can be emitted between
>>>> what's emitted by "call asm" and the end of the block (be it a fallthrough,
>>>> or a jump instruction. What gets emitted there is a move of the output
>>>> register to another location -- either a register or to the stack. And
>>>> therefore at the beginning of the "end" block, "%vals" is always in a
>>>> consistent location, no matter how you got to that block.
>>>>
>>>> But in the callbr case, there is not a location at which those moves
>>>> can be emitted, after the callbr, before the jump to "error".
>>>>
>>>
>>> I see what you mean. Let's say we create a pseudo-instruction (similar
>>> to landingpad, et al) that needs to be lowered by the backend in a
>>> reasonable manner. The EH stuff has an external process/library that
>>> performs the actual unwinding and which sets the values accordingly. We
>>> won't have this.
>>>
>>
>>
>>
>>> What we could do instead is split the edges and insert the
>>> copy-to-<where ever> statements there.
>>>
>>
>> Exactly -- except that doing that is potentially an invalid transform,
>> because the address is being used as a value, not simply a jump target. The
>> label list is just a list of _possible_ jump targets, changing those won't
>> actually affect anything. You'd instead need to change the blockaddress
>> constant, but in the general case you don't know where that address came
>> from -- (and it may therefore be required that you have the same address
>> for two separate callbr instructions).
>>
>> I guess this kinda touches on some of the same issues as in the other
>> discussion about the handling of the blockaddress in callbr and inlining,
>> etc...
>>
>> I wonder if we could put some validity restrictions on the IR structure,
>> rather than trying to fix things up after the fact by attempting to split
>> blocks. E.g., we could state that it's invalid to have a phi which uses the
>> value defined by a callbr, if it's conditioned on that same block as
>> predecessor.  That is: it's valid to use _other_ values defined in the
>> block ending in callbr, because they can be moved prior to the callbr. It's
>> also valid to use the value defined by the callbr in a phi conditioned on
>> some other intermediate block as predecessor, because then any required
>> moves can happen in the intermediate block.
>>
>> I believe such an IR restriction should be sufficient to make it possible
>> to emit valid code from the IR in all cases, but I'm a bit afraid of how
>> badly adding such odd edge-cases might screw up the rest of the compiler
>> and optimizer.
>>
>
> I want to revisit this. Here are the situations we're confronted with:
>
>    1. The goto-target can be jumped to by 1 callbr instruction,
>    2. The goto-target can be jumped to by N callbr instructions, which
>    don't need a PHI node, and
>    3. The goto-target can be jumped to by N callbr instructions, which
>    *do* need a PHI node.
>
> I'm going to plug the instruction I created out of thin air a few emails
> back, but better explain (I'm using an instruction instead of an intrinsic
> because we want that instruction to be right after all non-PHI instructions
> in the goto-target block). I'm _not_ suggesting we need this instruction.
> It's just for demonstration purposes.
>
>
> Situations (1) and (2) don't encounter an problem. Any value used in the
> goto-target can be handled by inserting the code to extract that value in
> the goto-target block:
>
>
> bb1:
>
>   ...
>
>   %x.bb1 = callbr i32 asm sideeffect "...", "=r,X"(i32 *%x*, i8*
> blockaddress(@bar, %goto.target))
>
>           to label %fallthrough1 [label %goto.target]
>
>
> fallthrough1:
>
>   ...
>
>
> bb2:
>
>   ...
>
>   %y.bb2 = callbr i32 asm sideeffect "...", "=r,X"(i32 *%y*, i8*
> blockaddress(@bar, %goto.target))
>
>           to label %fallthrough2 [label %goto.target]
>
>
> fallthrough2:
>
>   ...
>
>
> goto.target:
>
>   %x.goto = <extract value from %x.bb1>
>
>   %y.goto = <extract value from %y.bb2>
>
>   ... <uses of %x.goto and %y.goto> ...
>
>
> This leaves situation (3), which is far more complex as we've seen. To
> reiterate, the issue here is that we need to extract the values returned by
> callbr. This would typically be done by using a PHI node, but llvm may want
> to split critical edges or push the calculation back to the predecessor
> block, which won't work with the callbr asm, because it could branch out of
> the asm at any point thus skipping the extraction. So we can't use PHI
> nodes for these values. There are three classes of solutions to this:
>
>    1. Don't allow the values to be used in goto-targets, or
>    2. Allow them, but with significant restrictions, or
>    3. Allow them without using PHI nodes.
>
> Each has its benefits and drawbacks. As I've stated before, I think that
> (1) is too restrictive, but if we can't come up with a good solution, it
> may be our only option. Solution (2) could be a good compromise. However, I
> want to propose a potential solution to (3).
>
> The core of my proposal is to replace the PHI node with code that will
> replicate its behavior without code lowering trying to modify the CFG (at
> least not in ways that may invalidate the asm). Here is example code:
>
>
> bb1:
>
>   store i8* blockaddress(@bar, %bb1), i8** %src
>
>   %x.bb1 = callbr i32 asm sideeffect "...", "=r,X"(i32 %x, i8*
> blockaddress(@bar, %goto.target))
>
>           to label %fallthrough1 [label %goto.target]
>
>
> fallthrough1:
>
>   ...
>
>
> bb2:
>
>   store i8* blockaddress(@bar, %bb2), i8** %src
>
>   %x.bb2 = callbr i32 asm sideeffect "...", "=r,X"(i32 %x, i8*
> blockaddress(@bar, %goto.target))
>
>           to label %fallthrough2 [label %goto.target]
>
>
> fallthrough2:
>
>   ...
>
>
> goto.target:
>
>   %x1 = indirectval i8** %src, i32 [%x.bb1, %bb1], [%x.bb2, %bb2]
>
>   <extract values from %x1>
>
>   ...
>
>
> This can be lowered to this:
>
> bb1:
>
>   store i8* blockaddress(@bar, %bb1), i8** %src
>
>   %x.bb1 = callbr i32 asm sideeffect "...", "=r,X"(i32 %x, i8*
> blockaddress(@bar, %error))
>
>           to label %fallthrough1 [label %goto.target]
>
>
> fallthrough1:
>
>   ...
>
>
> bb2:
>
>   store i8* blockaddress(@bar, %bb2), i8** %src
>
>   %x.bb2 = callbr i32 asm sideeffect "...", "=r,X"(i32 %x, i8*
> blockaddress(@bar, %error))
>
>           to label %fallthrough2 [label %goto.target]
>
>
> fallthrough2:
>
>   ...
>
>
> goto.target:
>
>   %src1 = load i8**, i8*** @src
>
>   %src.bb = load i8*, i8** %src1
>
>   switch i64 %src.bb, label %goto.target.body [ ; or if-then-else blocks
>
>       i64 ptrtoint i8* blockaddress(@bar, %bb1) to i64, label
> %goto.target.bb1
>
>       i64 ptrtoint i8* blockaddress(@bar, %bb2) to i64, label
> %goto.target.bb2
>
>   ]
>
>
> goto.target.bb1:
>
>   %x1 = <extract value from %x.bb1>
>
>   br label %goto.target.body
>
>
> goto.target.bb2:
>
>   %x2 = <extract value from %x.bb2>
>
>   br label %goto.target.body
>
>
> goto.target.body:
>
>   %x.merge = phi i64 [%x1, label %goto.target.bb1], [%x1, label
> %goto.target.bb2]
>
>   ...
>
>
> With this, we don't change any values used by the callbr instructions, and
> the return values are extracted correctly. This has the unsavory issue of
> using stores and loads, but this may be the price we need to pay.
>
>
> Thoughts?
>
>
> The non-fallthrough blocks can have other predecessors, right? If so, I
> imagine that you need to also do the following:
>
Good point!

>  1. Store zero (or -1 or some other distinguishable value) into the %src
> alloca in the entry block.
>
> It should be a null value, as that's not a valid block address. Then
again, if we use the "switch" instruction the default branch should
suffice. We will probably want to reset the value after the callbr values
are extracted.

>  2. Store this same distinguishable value into the %src alloca after the
> "value extraction" is performed.
>
>  3. Include this distinguishable value in the switch statement.
>
> While Clang does not normally insert phi nodes, in this case perhaps the
> problem is self-contained enough for this to be reasonable. However, I'm
> not sure that this is worthwhile. This is a performance feature generally,
> and if the user really wants to use these outputs, are they going to want
> the extra expense of the branches and jump tables and all of the rest of
> it? Maybe in the common case the extraction blocks will be trivial and get
> merged, but the default case will still be problematic?
>
There are ways to avoid the branches, et al, mostly by writing the code in
the form of situations (1) and (2) by using lead-in blocks:

true_branch:
  goto body;

false_branch:
  goto body;

body:

  <use of common values here>

-bw
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20190701/3d7a4d27/attachment.html>