[llvm-dev] [RFC] Adding thread group semantics to LangRef (motivated by GPUs)

Sat Feb 9 12:01:50 PST 2019

On 09.02.19 18:06, Connor Abbott wrote:
> On Sat, Feb 9, 2019 at 4:44 PM Jan Sjodin <jan_sjodin at yahoo.com 
> <mailto:jan_sjodin at yahoo.com>> wrote:
> 
>     > The reason I'm looking for solutions that can work without "scanning the 
>     > code" or "spooky action at a distance" is that we should have a solution 
>     > that's easily digestible by folks who are not aware of GPU execution models.
>     >
>     > The fallback model for such folks should be: if your pass doesn't touch 
>     > specially-marked calls (whether that marker is "convergent" or something 
>     > else), then you don't have to worry. Which, as a corollary, means that 
>     > transforms can be conservatively correct by not touching such calls.
>     >
>     > I don't see how to achieve this goal while forcing passes to do specific 
>     > code scans.
> 
>     I'm wondering if using instructions/intrinsics is the right thing, or
>     is it a property of a basic block that control flow converges? Can the
>     intrinsics exist anywhere in a basic block? Is there a difference
>     between instructions in the same BB before and after a convergence
>     instrinsic?
> 
> 
> The way I was thinking, we'd allow the merge instruction anywhere in a 
> basic block. Otherwise, we'd have to allow for cases where there's a 
> trivial edge from a basic block with one sucessor to a block with one 
> predecessor that can't be eliminated. For example, in the single-break 
> case mentioned earlier:
> 
> while (true) {
>      if (...) {
>          ballot(); // should be executed before re-converging.
>          break;
>      }
> }
> ballot(); // should be executed after re-converging
> 
> With a merge instruction, the two ballots could be in the same basic 
> block, with a merge instruction separating them, versus two basic blocks 
> with the second marked as a merge block. Of course, this does have the 
> disadvantage that the backend would have to duplicate any convergent 
> operations before the merge instruction/intrinsic if the block has 
> multiple predecessors. I think someone will have to actually implement 
> it and see which is better.

Agreed.

>     If the IR will be multi-threaded by default (otherwise a flag for a
>     module would be needed), then I'm thinking that we would need a more
>     fundamental change to the IR to represent this, especially since
>     convergence can be specified without the presence of cross-lane
>     functions.
> 
> 
> I think that whatever we do, we definitely shouldn't allow any 
> convergent calls in a function unless the function definition itself is 
> marked convergent. That is, no one has to worry about cross-thread 
> operations in a function, and passes can even delete merge instructions 
> (although not recommended for performance!) unless it's marked 
> convergent. I believe this is already the case now. And since 
> convergence has to be specified somehow as soon as there are any 
> convergent calls, I don't think it's that much more radical to be able 
> to specify convergence without them. We need a fundamental change to the 
> IR to be able to correctly handle ballot() and friends anyways, and once 
> you accept that, then some kind of ability to explicitly specify merge 
> points isn't really all that much more of a change, and makes it a lot 
> easier to adapt optimization passes.
> 
> 
>     >>> It's certainly interesting to think about how to maintain correctness
>     >>> in the face of ballots() with such a pass, but a) it's certainly no
>     >>> harder with merge intrinsics than merges being implicit and b) I doubt
>     >>> that's useful for anything you'd want to do with a GPU.
>     >> 
>     >> Irreducible control flow has to be handled somehow, and linearization
>     >> is the only transform I know of, that will handle everything. I'm not
>     >> sure what the execution model says about irreducible control flow.
>     >
>     > For what it's worth, the reconverging CFG approach can also handle 
>     > arbitrary irreducible control flow.
>     >
>     > Whether (or how) it can do this while being compatible with whatever 
>     > semantics we come up with for cross-lane operations remains to be seen.
> 
>     I am wondering about execution order and if that is part of the
>     specification?
>     For example:
> 
>     if (A) {
>     L1:
>        ballot();
>        if (B)
>          goto L2;
>        else
>          goto END:
>     } else {
>     L2:
>        ballot();
>        if (C)
>          goto L1;
>        else
>          goto END;
>     }
>     END:
> 
>     Should the L1 ballot be executed first, or the L2, or does it not
>     matter?  I'm not sure what the ballots should return. Would swapping
>     the branches from A (use !A instead of A) be a legal transform?
> 
> 
> The proposal doesn't, and I don't think we should, specify which order 
> the diverging threads should execute in.

Agreed.

Basically if your high-level source code contains goto's, you should 
just be on your own - within reason, to ensure composability of code.

For example, a reasonable rule would be that ballot & friends are 
undefined after a goto instruction, until execution leaves the innermost 
scope which is SESE (cannot be entered or left by goto jumps).

Cheers,
Nicolai

> That's going to be highly 
> architecture-specific, and might even be determined at runtime. That is, 
> threads that have diverged should be treated just like completely 
> separate threads. (There's a subtle difference wrt atomics and forward 
> progress guarantees on some architectures, since execution of divergent 
> threads is usually serialized, though... but close enough). So yes, 
> swapping the branch condition should be allowed.
> 
> For your example, I think that under Nicolai's proposal only a merge 
> instruction after END would do anything (since none of the other basic 
> blocks post-dominate anything). Then control flow would only be 
> guaranteed to re-converge at the very end, and each unique path through 
> the loop could eventually split into its own thread group. We might want 
> to allow control to converge earlier, to make it easier to map this to 
> structured control flow. Of course, if you're writing GPU-oriented code 
> with divergent *and* irreducible control flow like that, then... well, 
> there's not much hope anyways :)
> 
> 
>     - Jan
> 
> 
> 
>     On Friday, February 1, 2019, 4:57:54 PM EST, Nicolai Hähnle
>     <nhaehnle at gmail.com <mailto:nhaehnle at gmail.com>> wrote:
> 
> 
>     On 31.01.19 15:59, Jan Sjodin wrote:
>      >> >  Any transform that re-arranges control flow would potentially
>     have to
>      >> >  know about the properties of ballot(), and the rules with
>     respect to
>      >> >  the CFG (and maybe consider the target) to know where to
>     insert the
>      >> >  intrinsics.
>      >
>      >> But the same is true for basically any approach to handling this. In
>      >> fact, adding the merge intrinsics makes this much easier. Beyond the
>      >> usual problems with hoisting ballots, which passes currently
>     avoid via
>      >> the current convergent + speculatable attributes, we'd only have to
>      >> add awareness to passes that they can't duplicate/combine merge
>      >> intrinsics or move them past convergent intrinsics, which is a local
>      >> property and hence easily checkable. In example I explained, without
>      >> some kind of merge intrinsic, tail duplication has to look at the
>      >> entire loop to know whether the transform is legal. Of course, you
>      >> could have some kind of "no convergent calls" flag on a
>     function, but
>      >> that doesn't eliminate the nastyness when there are convergent
>     calls.
>      >
>      > We will have to determine if the intrinsics are worth more
>     compared to
>      > scanning the code.
> 
>     The reason I'm looking for solutions that can work without "scanning
>     the
>     code" or "spooky action at a distance" is that we should have a
>     solution
>     that's easily digestible by folks who are not aware of GPU execution
>     models.
> 
>     The fallback model for such folks should be: if your pass doesn't touch
>     specially-marked calls (whether that marker is "convergent" or
>     something
>     else), then you don't have to worry. Which, as a corollary, means that
>     transforms can be conservatively correct by not touching such calls.
> 
>     I don't see how to achieve this goal while forcing passes to do
>     specific
>     code scans.
> 
> 
>      >> >   I had the impression that the control flow convergence was
>      >> >   in part specified by what the target architecture can handle.
>      >
>      >> This is true, although hopefully we can agree on something that
>      >> everyone can implement.
>      >
>      > Yes, hopefully there is an execution model that is works for
>     everyone.
>      >
>      >> >  One of
>      >> >  the more complicated cases would be linearization where the
>     control
>      >> >  flow is completely rewritten, and is encoded in a variable
>     that says
>      >> >  which basic block is the next one to execute.
>      >
>      >> It's certainly interesting to think about how to maintain
>     correctness
>      >> in the face of ballots() with such a pass, but a) it's certainly no
>      >> harder with merge intrinsics than merges being implicit and b) I
>     doubt
>      >> that's useful for anything you'd want to do with a GPU.
>      >
>      > Irreducible control flow has to be handled somehow, and linearization
>      > is the only transform I know of, that will handle everything. I'm not
>      > sure what the execution model says about irreducible control flow.
> 
>     For what it's worth, the reconverging CFG approach can also handle
>     arbitrary irreducible control flow.
> 
>     Whether (or how) it can do this while being compatible with whatever
>     semantics we come up with for cross-lane operations remains to be seen.
> 
> 
>      >> >  Another case is DCE,
>      >> >  where a ballot() could be eliminated, and it would
>     potentially have to
>      >> >  remove a number of intrinsics to enable later optimizations
>     (unless it
>      >> >  would affect performance?), so the intrinsics will require some
>      >> >  non-local updates.
>      >
>      >> Removing merge intrinsics if it's profitable and allowed is a
>      >> separate optimization, that can be done relatively quickly in a
>     single
>      >> pass without impacting the performance of other optimizations. It's
>      >> requiring expensive non-local checks in optimizations which modify
>      >> control flow that's a problem.
>      >
>      > There are certainly a lot of tradeoffs. My point was simply that the
>      > intrinsics are not strictly local.
> 
>     Right. I have some ideas for how to fix this, but wasn't able to
>     work on
>     them for the last weeks.
> 
>     Cheers,
>     Nicolai
> 
> 
>      >> >  So we might have them without a ballot(), which would seem to
>     make it
>      >> >  more difficult for structurizers or other transforms to
>     maintain the
>      >> >  semantics and insert intrinsics.
>      >
>      >> It's not any more difficult code-wise, since the case where there is
>      >> a ballot needs to be handled anyways. And while it might take longer
>      >> to process stuff when this hypothetical pass encounters a merge
>      >> intrinsic (I can't think of a real-world case where it would
>     matter),
>      >> it should result in better code generated.
>      >
>      > I was thinking of linearization or something similar, where an
>      > instrinsic by itself might be hard to preserve, compared to
>     looking at
>      > a ballot() and derive where intrinsics should be inserted.
>      >
>      >> > > > How are uniform branches handled? Do they affect the
>     convergence model?
>      >> > > >
>      >> > > We may be able to remove convergence points if branches are
>     uniform.
>      >> > > In Nicolai's proposal, I believe we'd want to remove a merge
>     intrinsic
>      >> > > when all the branches that it post-dominates that aren't also
>      >> > > post-dominated by some other merge intrinsic are uniform.
>      >> >
>      >> > I couldn't quite understand the last sentence, but I assume the
>      >> > conditions would prevent removing convergence points that help
>      >> > performance. Post-domination might not be adequate if there
>     are loops
>      >> > btw.
>      >>
>      >> It should be -- Nicolai's proposal is that a merge intrinsic merges
>      >> all the divergences caused by branches post dominated by the
>      >> intrinsic, so if all the divergent branches are merged by some other
>      >> intrinsic earlier, then there's no divergence and the merge
>     intrinsic
>      >> is a no-op.
>      >
>      > Makes sense.
>      >
>      > - Jan
>      > On Wednesday, January 30, 2019, 11:41:29 AM EST, Connor Abbott
>      > <cwabbott0 at gmail.com <mailto:cwabbott0 at gmail.com>> wrote:
>      >
>      >
>      > On Wed, Jan 30, 2019 at 4:20 PM Jan Sjodin <jan_sjodin at yahoo.com
>     <mailto:jan_sjodin at yahoo.com>
>      > <mailto:jan_sjodin at yahoo.com <mailto:jan_sjodin at yahoo.com>>> wrote:
>      >
>      >
>      >    >
>      >    > > > for (int i = 0; i < 2; i++) {
>      >    > > >  foo = ballot(true); // ballot 1
>      >    > > >
>      >    > > >    if (threadID /* ID of the thread within a
>     wavefront/warp */ % 2 == 0) continue;
>      >    > > >
>      >    > > >    bar = ballot(true); // ballot 2
>      >    > > > }
>      >    > > >
>      >    > > > versus:
>      >    > > >
>      >    > > > int i = 0;
>      >    > > > while (true) {
>      >    > > >    do {
>      >    > > >        if (i == 2) break;
>      >    > > >        foo = ballot(true); // ballot 1
>      >    > > >        i++;
>      >    > > >    } while (threadID % 2 == 0);
>      >    > > >
>      >    > > >    if (i == 2) break;
>      >    > > >    bar = ballot(true); // ballot 2
>      >    > > >    i++;
>      >    > > > }
>      >    > >
>      >    > > I think you can remove the second "i++", otherwise we can
>     increment "i" twice
>      >    > > if threadID % 2 != 0, but I see the issue. Using the
>     equivalence classes would
>      >    > > prevent this transform, since we have re-arranged the
>     control flow in that way,
>      >    >
>      >    > What do you mean? Note that the transforms that lead from
>     the first
>      >    > example to the second are actually desirable if there aren't any
>      >    > ballots or other convergent operations, so banning them
>     entirely is a
>      >    > no-go. Otherwise, note that the ballot here can be nested
>     arbitrarily
>      >    > deep, which means that jump forwarding/tail duplication has
>     to be
>      >    > aware of the entire loop unless we have some kind of merge
>     intrinsic.
>      >
>      >    Yes, if there weren't any calls to a ballot, then the
>     transform would
>      >    be legal. What I was saying in the beginning was that ballot()
>     would
>      >    have some special rules attached to it. It is of course
>     undesirable to
>      >    have flags to enforce correctness, but that is the point we are at
>      >    right now.
>      >
>      >    > Also, I should say that while interesting, control
>     equivalence classes
>      >    > can be part of the *implementation*, but not the
>     *specification*. We
>      >    > need
>      >    > to define the semantics of the IR -- that is, how a program
>      >    > compiled from any given IR is allowed to do when executed
>     (and *not*
>      >    > what it looks like when compiled) -- *first*, and then which
>      >    > transforms are allowed/not allowed will fall out of that. We
>     can't
>      >    > start by listing disallowed transforms, because then when
>     someone
>      >    > comes along and writes a new optimization, it might be
>     technically
>      >    > "allowed" even though it breaks something in practice.
>      >
>      >    I think you misunderstand me if you think I was listing disallowed
>      >    transforms. My question was if it is possible to have
>     multi-threaded
>      >    semantics in the source language, but in the compiler have a
>      >    single-threaded view, where some properties of the CFG would
>     determine
>      >    what is legal and not for some functions with a special flag.
>      >
>      >
>      > But that's practically the same as listing allowed and disallowed
>      > transforms -- you're defining what the final IR is allowed to
>     look like,
>      > not how it's allowed to execute. If at all possible, the former
>     should
>      > be derived from the latter.
>      >
>      >    I agree
>      >    it is more desirable to have the the semantics specified in the
>      >    IR. However, I am exploring this from a practical point of
>     view, since
>      >    ballot() is very rare compared to all code that is being
>     compiled for
>      >    all targets. These intrinsics would always have to be
>     considered when
>      >    writing a pass. They seem to me harder to think about, and
>      >    test, for someone who is working on a single-threaded target,
>     compared
>      >    to looking at a flag. If we allow ourselves to look at which
>      >    transforms that might violate these properties, we could would
>     free up
>      >    the rest of the compiler (and developers) to not have to worry
>     about
>      >    these things. Intrinsics would have to be maintained
>     throughout the
>      >    entire compilation process in every pass.
>      >
>      >    > The only way to
>      >    > conclusively prove that transforms will never break code that's
>      >    > supposed to work (except for bugs in the transforms) is to
>     define the
>      >    > semantics of the IR, and then to make sure that it refines the
>      >    > semantics of the source language. This is why the LangRef
>     almost never
>      >    > talks about allowed/disallowed transforms except as examples to
>      >    > explain some given semantics, and if you don't follow that
>     rule, your
>      >    > patch will probably be rejected.
>      >
>      >    I'm trying to figure out if we are in the "almost never" territory
>      >    here or not.
>      >
>      >
>      > I don't think so at all. In addition to being much, much harder to
>      > reason about and prove that the approach is sound, I think it'll
>     be more
>      > intrusive.
>      >
>      >
>      >    > Now, to define a semantics for ballot(), we need to define
>     what it's
>      >    > allowed to return for any given execution of any given
>     program, and to
>      >    > do that, we need to define which threads must be active
>     together when
>      >    > it's reached, which in turns means we need to define when
>     control flow
>      >    > can re-converge somehow. Any proposal for how to handle
>     ballot() must
>      >    > start here.
>      >
>      >    > > I'm not sure if using these rules will be easier or harder
>     than dealing with
>      >    > > intrinsics. One problem is that the rules for
>     single-threaded code might seem
>      >    > > arbitrary, and it would be hard to reason about them in a
>     larger context.
>      >
>      >    > > What happens to new control flow created by transforms,
>     and what will guide
>      >    > > the insertion of intrinsics in the new code? Code
>     structurization is one example
>      >    > > were this could happen.
>      >
>      >    > Hopefully, it's clear from the above how this all falls out. The
>      >    > frontend for e.g. GLSL or SPIR-V would have to insert the merge
>      >    > intrinsics to preserve the semantics of the source language. Any
>      >    > transforms must refine the semantics of the IR, although I
>     can't think
>      >    > of a scenario where that would involve emitting any new merge
>      >    > intrinsics. Usually, structurized IR's have their own
>     semantics about
>      >    > when control flow converges, so a code structurizer should
>     respect the
>      >    > original semantics. AMDGPU has its own code structurizer
>     that runs
>      >    > very late in the pipeline (although there are plans to
>     remove it), so
>      >    > we might have to change that to make it respect the merge
>     intrinsic
>      >    > intrinsics, and then we'll correctly implement them "for free".
>      >
>      >    Any transform that re-arranges control flow would potentially
>     have to
>      >    know about the properties of ballot(), and the rules with
>     respect to
>      >    the CFG (and maybe consider the target) to know where to
>     insert the
>      >    intrinsics.
>      >
>      >
>      > But the same is true for basically any approach to handling this. In
>      > fact, adding the merge intrinsics makes this much easier. Beyond the
>      > usual problems with hoisting ballots, which passes currently
>     avoid via
>      > the current convergent + speculatable attributes, we'd only have
>     to add
>      > awareness to passes that they can't duplicate/combine merge
>     intrinsics
>      > or move them past convergent intrinsics, which is a local
>     property and
>      > hence easily checkable. In example I explained, without some kind of
>      > merge intrinsic, tail duplication has to look at the entire loop
>     to know
>      > whether the transform is legal. Of course, you could have some
>     kind of
>      > "no convergent calls" flag on a function, but that doesn't
>     eliminate the
>      > nastyness when there are convergent calls.
>      >
>      >    I had the impression that the control flow convergence was
>      >    in part specified by what the target architecture can handle.
>      >
>      >
>      > This is true, although hopefully we can agree on something that
>     everyone
>      > can implement.
>      >
>      >    One of
>      >    the more complicated cases would be linearization where the
>     control
>      >    flow is completely rewritten, and is encoded in a variable
>     that says
>      >    which basic block is the next one to execute.
>      >
>      >
>      > It's certainly interesting to think about how to maintain
>     correctness in
>      > the face of ballots() with such a pass, but a) it's certainly no
>     harder
>      > with merge intrinsics than merges being implicit and b) I doubt
>     that's
>      > useful for anything you'd want to do with a GPU.
>      >
>      >    Another case is DCE,
>      >    where a ballot() could be eliminated, and it would potentially
>     have to
>      >    remove a number of intrinsics to enable later optimizations
>     (unless it
>      >    would affect performance?), so the intrinsics will require some
>      >    non-local updates.
>      >
>      >
>      > Removing merge intrinsics if it's profitable and allowed is a
>     separate
>      > optimization, that can be done relatively quickly in a single pass
>      > without impacting the performance of other optimizations. It's
>     requiring
>      > expensive non-local checks in optimizations which modify control
>     flow
>      > that's a problem.
>      >
>      >
>      >    > > > Would they only be present if ballot and similar
>     functions are used, or do they
>      >    > > > have to be present everywhere?
>      >    > >
>      >    > > They'd only have to be present when ballot or other convergent
>      >    > > functions are called, since otherwise it doesn't matter
>     when control
>      >    > flow re-converges. However, we may want to keep them around for
>      >    > performance reasons (usually destroying convergence points
>     is bad for
>      >    > performance).
>      >
>      >    So we might have them without a ballot(), which would seem to
>     make it
>      >    more difficult for structurizers or other transforms to
>     maintain the
>      >    semantics and insert intrinsics.
>      >
>      >
>      > It's not any more difficult code-wise, since the case where there
>     is a
>      > ballot needs to be handled anyways. And while it might take
>     longer to
>      > process stuff when this hypothetical pass encounters a merge
>     intrinsic
>      > (I can't think of a real-world case where it would matter), it
>     should
>      > result in better code generated.
>      >
>      >
>      >    > > How are uniform branches handled? Do they affect the
>     convergence model?
>      >    > >
>      >    > We may be able to remove convergence points if branches are
>     uniform.
>      >    > In Nicolai's proposal, I believe we'd want to remove a merge
>     intrinsic
>      >    > when all the branches that it post-dominates that aren't also
>      >    > post-dominated by some other merge intrinsic are uniform.
>      >
>      >    I couldn't quite understand the last sentence, but I assume the
>      >    conditions would prevent removing convergence points that help
>      >    performance. Post-domination might not be adequate if there
>     are loops
>      >    btw.
>      >
>      >
>      > It should be -- Nicolai's proposal is that a merge intrinsic
>     merges all
>      > the divergences caused by branches post dominated by the
>     intrinsic, so
>      > if all the divergent branches are merged by some other intrinsic
>      > earlier, then there's no divergence and the merge intrinsic is a
>     no-op.
>      >
>      >
>      >    - Jan
>      >
>      >
>      >
>      >    On Wednesday, January 30, 2019, 6:29:52 AM EST, Connor Abbott
>      >    <cwabbott0 at gmail.com <mailto:cwabbott0 at gmail.com>
>     <mailto:cwabbott0 at gmail.com <mailto:cwabbott0 at gmail.com>>> wrote:
>      >
>      >
>      >    On Mon, Jan 28, 2019 at 9:09 PM Jan Sjodin
>     <jan_sjodin at yahoo.com <mailto:jan_sjodin at yahoo.com>
>      >    <mailto:jan_sjodin at yahoo.com <mailto:jan_sjodin at yahoo.com>>>
>     wrote:
>      >      >
>      >      > > for (int i = 0; i < 2; i++) {
>      >      > >  foo = ballot(true); // ballot 1
>      >      > >
>      >      > >    if (threadID /* ID of the thread within a
>     wavefront/warp */
>      >    % 2 == 0) continue;
>      >      > >
>      >      > >    bar = ballot(true); // ballot 2
>      >      > > }
>      >      > >
>      >      > > versus:
>      >      > >
>      >      > > int i = 0;
>      >      > > while (true) {
>      >      > >    do {
>      >      > >        if (i == 2) break;
>      >      > >        foo = ballot(true); // ballot 1
>      >      > >        i++;
>      >      > >    } while (threadID % 2 == 0);
>      >      > >
>      >      > >    if (i == 2) break;
>      >      > >    bar = ballot(true); // ballot 2
>      >      > >    i++;
>      >      > > }
>      >      >
>      >      > I think you can remove the second "i++", otherwise we can
>      >    increment "i" twice
>      >      > if threadID % 2 != 0, but I see the issue. Using the
>     equivalence
>      >    classes would
>      >      > prevent this transform, since we have re-arranged the control
>      >    flow in that way,
>      >
>      >    What do you mean? Note that the transforms that lead from the
>     first
>      >    example to the second are actually desirable if there aren't any
>      >    ballots or other convergent operations, so banning them
>     entirely is a
>      >    no-go. Otherwise, note that the ballot here can be nested
>     arbitrarily
>      >    deep, which means that jump forwarding/tail duplication has to be
>      >    aware of the entire loop unless we have some kind of merge
>     intrinsic.
>      >
>      >    Also, I should say that while interesting, control equivalence
>     classes
>      >    can be part of the *implementation*, but not the
>     *specification*. We
>      >    need to define the semantics of the IR -- that is, how a program
>      >    compiled from any given IR is allowed to do when executed (and
>     *not*
>      >    what it looks like when compiled) -- *first*, and then which
>      >    transforms are allowed/not allowed will fall out of that. We can't
>      >    start by listing disallowed transforms, because then when someone
>      >    comes along and writes a new optimization, it might be technically
>      >    "allowed" even though it breaks something in practice. The
>     only way to
>      >    conclusively prove that transforms will never break code that's
>      >    supposed to work (except for bugs in the transforms) is to
>     define the
>      >    semantics of the IR, and then to make sure that it refines the
>      >    semantics of the source language. This is why the LangRef
>     almost never
>      >    talks about allowed/disallowed transforms except as examples to
>      >    explain some given semantics, and if you don't follow that
>     rule, your
>      >    patch will probably be rejected.
>      >
>      >    Now, to define a semantics for ballot(), we need to define
>     what it's
>      >    allowed to return for any given execution of any given
>     program, and to
>      >    do that, we need to define which threads must be active
>     together when
>      >    it's reached, which in turns means we need to define when
>     control flow
>      >    can re-converge somehow. Any proposal for how to handle
>     ballot() must
>      >    start here.
>      >
>      >      > I'm not sure if using these rules will be easier or harder
>     than
>      >    dealing with
>      >      > intrinsics. One problem is that the rules for single-threaded
>      >    code might seem
>      >      > arbitrary, and it would be hard to reason about them in a
>     larger
>      >    context.
>      >      >
>      >      > > Nicolai's proposal solves this by having the frontend emit a
>      >    merge intrinsic
>      >      > > before the i++ is emitted. This prevents the jump forwarding
>      >    from occurring.
>      >      >
>      >      > I was thinking about getting through the single-thread
>     view and
>      >    the issues with
>      >      > that first, then I will think more about the multi-thread and
>      >    explicit convergence.
>      >      >
>      >      > If we are done with the single-thread stuff for now these
>     are the
>      >    question that I
>      >      > have been thinking about with the multi-threaded view:
>      >      >
>      >      > What happens to new control flow created by transforms,
>     and what
>      >    will guide
>      >      > the insertion of intrinsics in the new code? Code
>     structurization
>      >    is one example
>      >      > were this could happen.
>      >
>      >    Hopefully, it's clear from the above how this all falls out. The
>      >    frontend for e.g. GLSL or SPIR-V would have to insert the merge
>      >    intrinsics to preserve the semantics of the source language. Any
>      >    transforms must refine the semantics of the IR, although I
>     can't think
>      >    of a scenario where that would involve emitting any new merge
>      >    intrinsics. Usually, structurized IR's have their own
>     semantics about
>      >    when control flow converges, so a code structurizer should
>     respect the
>      >    original semantics. AMDGPU has its own code structurizer that runs
>      >    very late in the pipeline (although there are plans to remove
>     it), so
>      >    we might have to change that to make it respect the merge
>     intrinsic
>      >    intrinsics, and then we'll correctly implement them "for free".
>      >
>      >      >
>      >      > Would they only be present if ballot and similar functions are
>      >    used, or do they
>      >      > have to be present everywhere?
>      >
>      >    They'd only have to be present when ballot or other convergent
>      >    functions are called, since otherwise it doesn't matter when
>     control
>      >    flow re-converges. However, we may want to keep them around for
>      >    performance reasons (usually destroying convergence points is
>     bad for
>      >    performance).
>      >
>      >      >
>      >      > How are uniform branches handled? Do they affect the
>     convergence
>      >    model?
>      >
>      >    We may be able to remove convergence points if branches are
>     uniform.
>      >    In Nicolai's proposal, I believe we'd want to remove a merge
>     intrinsic
>      >    when all the branches that it post-dominates that aren't also
>      >    post-dominated by some other merge intrinsic are uniform.
>      >
>      >
>      >
>      >      >
>      >      >
>      >      > - Jan
>      >      >
>      >      >
>      >      > On Monday, January 28, 2019, 11:16:36 AM EST, Connor Abbott
>      >    <cwabbott0 at gmail.com <mailto:cwabbott0 at gmail.com>
>     <mailto:cwabbott0 at gmail.com <mailto:cwabbott0 at gmail.com>>> wrote:
>      >      >
>      >      >
>      >      >
>      >      > On Fri, Jan 25, 2019 at 3:05 AM Jan Sjodin
>     <jan_sjodin at yahoo.com <mailto:jan_sjodin at yahoo.com>
>      >    <mailto:jan_sjodin at yahoo.com <mailto:jan_sjodin at yahoo.com>>>
>     wrote:
>      >      > >
>      >      > > > for (...) {
>      >      > > >    ballot();
>      >      > > >    if (... /* non-uniform */) continue;
>      >      > > > }
>      >      > > >
>      >      > > > into
>      >      > > >
>      >      > > > for (...) {
>      >      > > >    do {
>      >      > > >        ballot();
>      >      > > >    } while (... /* non-uniform */);
>      >      > > > }
>      >      > >
>      >      > > I'm not sure if I follow this example, could you and
>     explain a
>      >    bit more?
>      >      > > It looks to me that the condition in the "if" must be
>     false (if the
>      >      > > same condition is used in the while), or we would
>      >      > > call ballot the wrong number of times.
>      >      >
>      >      > Yes, the idea is that the same condition is used in the if and
>      >    the do-while. I think I messed up the example a little... in the
>      >    second snippet, we're supposed to break out of the inner loop
>     if the
>      >    outer loop's exit condition is true. Here's a more concrete
>     example:
>      >      >
>      >      > for (int i = 0; i < 2; i++) {
>      >      >    foo = ballot(true); // ballot 1
>      >      >
>      >      >    if (threadID /* ID of the thread within a
>     wavefront/warp */ %
>      >    2 == 0) continue;
>      >      >
>      >      >    bar = ballot(true); // ballot 2
>      >      > }
>      >      >
>      >      > versus:
>      >      >
>      >      > int i = 0;
>      >      > while (true) {
>      >      >    do {
>      >      >        if (i == 2) break;
>      >      >        foo = ballot(true); // ballot 1
>      >      >        i++;
>      >      >    } while (threadID % 2 == 0);
>      >      >
>      >      >    if (i == 2) break;
>      >      >    bar = ballot(true); // ballot 2
>      >      >    i++;
>      >      > }
>      >      >
>      >      > From a single-threaded perspective, these two snippets are
>      >    identical, even if ballot() writes to arbitrary memory. The first
>      >    can easily get transformed to something like the second when LLVM
>      >    decides to duplicate the final i++ through jump forwarding,
>     and then
>      >    re-interprets the loop as two nested loops and splits the loop
>      >    header in two. This is what currently happens with DOOM when
>     we try
>      >    to enable subgroup operations with it. Let's say there are two
>      >    threads in a wavefront. Then the execution trace mandated by
>     SPIR-V
>      >    for the first looks like:
>      >      >
>      >      > thread 0        | thread 1
>      >      > ballot 1 = 0b11 | ballot 1 = 0b11
>      >      > skipped        | ballot 2 = 0b10
>      >      > ballot 1 = 0b11 | ballot 1 = 0b11
>      >      > skipped        | ballot 2 = 0b10
>      >      >
>      >      > Now, contrast this with the execution trace that programmers
>      >    would expect for the second example:
>      >      >
>      >      > thread 0        | thread 1
>      >      > ballot 1 = 0b11 | ballot 1 = 0b11
>      >      > ballot 1 = 0b01 | skipped
>      >      > skipped        | ballot 2 = 0b10
>      >      > skipped        | ballot 1 = 0b10
>      >      > skipped        | ballot 2 = 0b10
>      >      >
>      >      > Nicolai's proposal solves this by having the frontend emit a
>      >    merge intrinsic before the i++ is emitted. This prevents the jump
>      >    forwarding from occurring.
>      >      >
>      >      >
>      >      > >
>      >      > > About the CSE, when would that be legal? I can imagine
>     with uniform
>      >      > > branches that it could work, but would like to see an
>     example to
>      >      > > fully understand this.
>      >      > >
>      >      > > I agree that it would be more conservative than if we
>     model the
>      >    threading,
>      >      > > but I'm not sure about the cost/benefit. I am mostly
>     curious if
>      >    it is
>      >      > > possible to have a single-thread view or not. Then we
>     would have to
>      >      > > see if it is adequate.
>      >      > >
>      >      > > - Jan
>      >      > >
>      >      > > On Thursday, January 24, 2019, 10:31:47 AM EST, Connor
>     Abbott
>      >    <cwabbott0 at gmail.com <mailto:cwabbott0 at gmail.com>
>     <mailto:cwabbott0 at gmail.com <mailto:cwabbott0 at gmail.com>>> wrote:
>      >      > >
>      >      > >
>      >      > > I don't see how this would fix the continue vs. nested loop
>      >    problem I
>      >      > > explained earlier. That is, how would this prevent turning:
>      >      > >
>      >      > > for (...) {
>      >      > >    ballot();
>      >      > >    if (... /* non-uniform */) continue;
>      >      > > }
>      >      > >
>      >      > > into
>      >      > >
>      >      > > for (...) {
>      >      > >    do {
>      >      > >        ballot();
>      >      > >    } while (... /* non-uniform */);
>      >      > > }
>      >      > >
>      >      > > and vice versa? Note that there's no duplication going
>     on here, and
>      >      > > the single-threaded flow of control is exactly the same.
>      >      > >
>      >      > > Another reason this isn't so great is that it prevents
>     e.g. CSE on
>      >      > > ballots that actually should be combined, since you're
>      >    modelling it as
>      >      > > a write. It seems like the optimizer is going to need
>     some special
>      >      > > knowledge of convergent things that fake memory constraints
>      >    can't give
>      >      > > us.
>      >      > >
>      >      > > On Thu, Jan 24, 2019 at 4:06 PM Jan Sjodin
>      >    <jan_sjodin at yahoo.com <mailto:jan_sjodin at yahoo.com>
>     <mailto:jan_sjodin at yahoo.com <mailto:jan_sjodin at yahoo.com>>> wrote:
>      >      > > >
>      >      > > >
>      >      > > > I was looking into ballot() and how if it is possible
>     to keep
>      >    a single-threaded
>      >      > > > view of the code, but add some extra conditions that must
>      >    hold after the
>      >      > > > transformation. I had the initial idea that each call to
>      >    ballot() in a
>      >      > > > single-threaded program can be seen as a partial write
>     to a
>      >    memory
>      >      > > > location, and each location memory location is unique for
>      >    every call site,
>      >      > > > plus there some externally observable side effect. We can
>      >    abstract this
>      >      > > > away by tagging the calls, e.g. by using aliases.
>      >      > > >
>      >      > > > For example:
>      >      > > >
>      >      > > > if (...) {
>      >      > > >      foo1 = ballot();
>      >      > > > } else {
>      >      > > >      foo2 = ballot();
>      >      > > > }
>      >      > > >
>      >      > > > simply becomes:
>      >      > > >
>      >      > > > if (...) {
>      >      > > >      foo1 = ballot_1();
>      >      > > > } else {
>      >      > > >      foo2 = ballot_2();
>      >      > > > }
>      >      > > >
>      >      > > >
>      >      > > > and
>      >      > > >
>      >      > > > if (...) {
>      >      > > > } else {
>      >      > > > }
>      >      > > > ballot();
>      >      > > >
>      >      > > > becomes
>      >      > > >
>      >      > > > if (...) {
>      >      > > > } else {
>      >      > > > }
>      >      > > > ballot_1();
>      >      > > >
>      >      > > > In the first case it would prevent combining the two calls
>      >    into one
>      >      > > > after the if. In the second example there is generally
>      >    nothing that
>      >      > > > says it could not be transformed into the first
>     example with two
>      >      > > > calls to ballot_1(), which should not be allowed.
>      >      > > >
>      >      > > > Another form of duplication that we must allow are loop
>      >    transforms,
>      >      > > > like unrolling or peeling. These might seem similar to the
>      >    example
>      >      > > > above, since we are cloning code and with conditions
>     etc. But
>      >      > > > they are different since they calls are in different loop
>      >    iterations.
>      >      > > >
>      >      > > > The condition that needs to be met is that:
>      >      > > >
>      >      > > > There must be a single path between all cloned ballot_n()
>      >    functions.
>      >      > > >
>      >      > > > The reason for this condition is that if we clone the same
>      >    call, then
>      >      > > > the copies must be mutually exclusive, but if they are
>     cloned
>      >    from
>      >      > > > a loop, there must be a path, or we would skip iterations.
>      >      > > >
>      >      > > > If we want to be more sophisticated we can add:
>      >      > > >
>      >      > > > If there is no such path, the calls must be separated by
>      >    uniform branches.
>      >      > > >
>      >      > > > After the transform everything should be re-tagged,
>     since we
>      >    already
>      >      > > > checked the calls and we don't want to check them again.
>      >    Also, not all
>      >      > > > transforms need (or should) have the tagging checked. One
>      >    example is
>      >      > > > inlining, where multiple copies are created, but they are
>      >    clearly different
>      >      > > > calls. The tagging can be done temporarily for a
>     single pass,
>      >    and then
>      >      > > > eliminated. This could be a good tool for debugging as
>     well,
>      >    since it can
>      >      > > > detect if a transform is suspect.
>      >      > > >
>      >      > > > The code would of course have to make sense as far as
>     control
>      >    flow. If
>      >      > > > we have:
>      >      > > >
>      >      > > > for(;;) {
>      >      > > >    if(...) {
>      >      > > >      ballot();
>      >      > > >      break;
>      >      > > >    }
>      >      > > > }
>      >      > > >
>      >      > > > This would have to be translated to:
>      >      > > >
>      >      > > > for(;;) {
>      >      > > >    if(...) {
>      >      > > >      ballot();
>      >      > > >      if(UnknownConstant) {  // Not a uniform
>     condition, but
>      >    later on translated to "true"
>      >      > > >        break;
>      >      > > >      }
>      >      > > > }
>      >      > > >
>      >      > > > However, I think that this is the way the code is
>     generated
>      >    today anyway.
>      >      > > > There would have to be some attribute that indicate that
>      >    these calls (or functions)
>      >      > > > contain ballot or other cross-lane operations so they
>     could
>      >    be tagged and
>      >      > > > checked. The attribute would be used by the passes to know
>      >    that the special
>      >      > > > conditions exist for those calls.
>      >      > > >
>      >      > > > As far as what it means to have a path, it could be
>     complicated.
>      >      > > > For example:
>      >      > > >
>      >      > > > x = ...
>      >      > > > ballot_1();
>      >      > > >
>      >      > > > could be transformed to:
>      >      > > >
>      >      > > > if (x < 4711) {
>      >      > > >  ballot_1();
>      >      > > >
>      >      > > > if(x >= 4711) {
>      >      > > >  ballot_1();
>      >      > > > }
>      >      > > >
>      >      > > > So a simple path check would say there is a path, and the
>      >    transform is legal,
>      >      > > > but if we examine the conditions, there is no path,
>     and the
>      >    transform should not be legal.
>      >      > > > It could be made even more obscure of course, but I
>     don't see
>      >    any optimizations really
>      >      > > > doing this kind of thing,
>      >      > > >
>      >      > > > - Jan
>      >      > > >
>      >      > > > On Saturday, December 29, 2018, 11:32:25 AM EST, Nicolai
>      >    Hähnle via llvm-dev <llvm-dev at lists.llvm.org
>     <mailto:llvm-dev at lists.llvm.org>
>      >    <mailto:llvm-dev at lists.llvm.org
>     <mailto:llvm-dev at lists.llvm.org>>> wrote:
>      >      > > >
>      >      > > >
>      >      > > > On 20.12.18 18:03, Connor Abbott wrote:
>      >      > > > >    We already have the notion of "convergent"
>     functions like
>      >      > > > >    syncthreads(), to which we cannot add control-flow
>      >    dependencies.
>      >      > > > >    That is, it's legal to hoist syncthreads out of
>     an "if",
>      >    but it's
>      >      > > > >    not legal to sink it into an "if".  It's not
>     clear to me
>      >    why we
>      >      > > > >    can't have "anticonvergent" (terrible name) functions
>      >    which cannot
>      >      > > > >    have control-flow dependencies removed from them?
>      >    ballot() would be
>      >      > > > >    both convergent and anticonvergent.
>      >      > > > >
>      >      > > > >    Would that solve your problem?
>      >      > > > >
>      >      > > > >
>      >      > > > > I think it's important to note that we already have
>     such an
>      >    attribute,
>      >      > > > > although with the opposite sense - it's impossible to
>      >    remove control
>      >      > > > > flow dependencies from a call unless you mark it as
>      >    "speculatable".
>      >      > > >
>      >      > > > This isn't actually true. If both sides of an if/else have
>      >    the same
>      >      > > > non-speculative function call, it can still be moved
>     out of
>      >    control flow.
>      >      > > >
>      >      > > > That's because doing so doesn't change anything at all
>     from a
>      >      > > > single-threaded perspective. Hence why I think we should
>      >    model the
>      >      > > > communication between threads honestly.
>      >      > > >
>      >      > > >
>      >      > > > > However, this doesn't prevent
>      >      > > > >
>      >      > > > > if (...) {
>      >      > > > > } else {
>      >      > > > > }
>      >      > > > > foo = ballot();
>      >      > > > >
>      >      > > > > from being turned into
>      >      > > > >
>      >      > > > > if (...) {
>      >      > > > >      foo1 = ballot();
>      >      > > > > } else {
>      >      > > > >      foo2 = ballot();
>      >      > > > > }
>      >      > > > > foo = phi(foo1, foo2)
>      >      > > > >
>      >      > > > > and vice versa. We have a "noduplicate" attribute which
>      >    prevents
>      >      > > > > transforming the first into the second, but not the
>     other
>      >    way around. Of
>      >      > > > > course we could keep going this way and add a
>     "nocombine"
>      >    attribute to
>      >      > > > > complement noduplicate. But even then, there are
>     even still
>      >    problematic
>      >      > > > > transforms. For example, take this program, which is
>      >    simplified from a
>      >      > > > > real game that doesn't work with the AMDGPU backend:
>      >      > > > >
>      >      > > > > while (cond1 /* uniform */) {
>      >      > > > >      ballot();
>      >      > > > >      ...
>      >      > > > >      if (cond2 /* non-uniform */) continue;
>      >      > > > >      ...
>      >      > > > > }
>      >      > > > >
>      >      > > > > In SPIR-V, when using structured control flow, the
>      >    semantics of this are
>      >      > > > > pretty clearly defined. In particular, there's a
>     continue
>      >    block after
>      >      > > > > the body of the loop where control flow
>     re-converges, and
>      >    the only back
>      >      > > > > edge is from the continue block, so the ballot is in
>      >    uniform control
>      >      > > > > flow. But LLVM will get rid of the continue block since
>      >    it's empty, and
>      >      > > > > re-analyze the loop as two nested loops, splitting
>     the loop
>      >    header in
>      >      > > > > two, producing a CFG which corresponds to this:
>      >      > > > >
>      >      > > > > while (cond1 /* uniform */) {
>      >      > > > >      do {
>      >      > > > >          ballot();
>      >      > > > >          ...
>      >      > > > >      } while (cond2 /* non-uniform */);
>      >      > > > >      ...
>      >      > > > > }
>      >      > > > >
>      >      > > > > Now, in an implementation where control flow
>     re-converges
>      >    at the
>      >      > > > > immediate post-dominator, this won't do the right thing
>      >    anymore. In
>      >      > > > > order to handle it correctly, you'd effectively need to
>      >    always flatten
>      >      > > > > nested loops, which will probably be really bad for
>      >    performance if the
>      >      > > > > programmer actually wanted the second thing. It also
>     makes
>      >    it impossible
>      >      > > > > when translating a high-level language to LLVM to
>     get the
>      >    "natural"
>      >      > > > > behavior which game developers actually expect. This is
>      >    exactly the sort
>      >      > > > > of "spooky action at a distance" which makes me
>     think that
>      >    everything
>      >      > > > > we've done so far is really insufficient, and we need to
>      >    add an explicit
>      >      > > > > notion of control-flow divergence and reconvergence
>     to the
>      >    IR. We need a
>      >      > > > > way to say that control flow re-converges at the
>     continue
>      >    block, so that
>      >      > > > > LLVM won't eliminate it, and we can vectorize it
>     correctly
>      >    without
>      >      > > > > penalizing cases where it's better for control flow
>     not to
>      >    re-converge.
>      >      > > >
>      >      > > > Well said!
>      >      > > >
>      >      > > > Cheers,
>      >      > > >
>      >      > > > Nicolai
>      >      > > > --
>      >      > > > Lerne, wie die Welt wirklich ist,
>      >      > > > Aber vergiss niemals, wie sie sein sollte.
>      >      > > > _______________________________________________
>      >      > > > LLVM Developers mailing list
>      >      > > > llvm-dev at lists.llvm.org
>     <mailto:llvm-dev at lists.llvm.org> <mailto:llvm-dev at lists.llvm.org
>     <mailto:llvm-dev at lists.llvm.org>>
>      >      > > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>      >
> 
> 
>     -- 
>     Lerne, wie die Welt wirklich ist,
>     Aber vergiss niemals, wie sie sein sollte.
> 

-- 
Lerne, wie die Welt wirklich ist,
Aber vergiss niemals, wie sie sein sollte.