[llvm-dev] [RFC] Adding thread group semantics to LangRef (motivated by GPUs)

Wed Jan 30 23:01:38 PST 2019

On Wed, Jan 30, 2019 at 7:20 AM Jan Sjodin via llvm-dev <
llvm-dev at lists.llvm.org> wrote:

>
> >
> > > > for (int i = 0; i < 2; i++) {
> > > >  foo = ballot(true); // ballot 1
> > > >
> > > >    if (threadID /* ID of the thread within a wavefront/warp */ % 2
> == 0) continue;
> > > >
> > > >    bar = ballot(true); // ballot 2
> > > > }
> > > >
> > > > versus:
> > > >
> > > > int i = 0;
> > > > while (true) {
> > > >    do {
> > > >        if (i == 2) break;
> > > >        foo = ballot(true); // ballot 1
> > > >        i++;
> > > >    } while (threadID % 2 == 0);
> > > >
> > > >    if (i == 2) break;
> > > >    bar = ballot(true); // ballot 2
> > > >    i++;
> > > > }
> > >
> > > I think you can remove the second "i++", otherwise we can increment
> "i" twice
> > > if threadID % 2 != 0, but I see the issue. Using the equivalence
> classes would
> > > prevent this transform, since we have re-arranged the control flow in
> that way,
> >
> > What do you mean? Note that the transforms that lead from the first
> > example to the second are actually desirable if there aren't any
> > ballots or other convergent operations, so banning them entirely is a
> > no-go. Otherwise, note that the ballot here can be nested arbitrarily
> > deep, which means that jump forwarding/tail duplication has to be
> > aware of the entire loop unless we have some kind of merge intrinsic.
>
> Yes, if there weren't any calls to a ballot, then the transform would
> be legal. What I was saying in the beginning was that ballot() would
> have some special rules attached to it. It is of course undesirable to
> have flags to enforce correctness, but that is the point we are at
> right now.
>
> > Also, I should say that while interesting, control equivalence classes
> > can be part of the *implementation*, but not the *specification*. We
> > need
> > to define the semantics of the IR -- that is, how a program
> > compiled from any given IR is allowed to do when executed (and *not*
> > what it looks like when compiled) -- *first*, and then which
> > transforms are allowed/not allowed will fall out of that. We can't
> > start by listing disallowed transforms, because then when someone
> > comes along and writes a new optimization, it might be technically
> > "allowed" even though it breaks something in practice.
>
> I think you misunderstand me if you think I was listing disallowed
> transforms. My question was if it is possible to have multi-threaded
> semantics in the source language, but in the compiler have a
> single-threaded view, where some properties of the CFG would determine
> what is legal and not for some functions with a special flag. I agree
> it is more desirable to have the the semantics specified in the
> IR. However, I am exploring this from a practical point of view, since
> ballot() is very rare compared to all code that is being compiled for
> all targets. These intrinsics would always have to be considered when
> writing a pass. They seem to me harder to think about, and
> test, for someone who is working on a single-threaded target, compared
> to looking at a flag. If we allow ourselves to look at which
> transforms that might violate these properties, we could would free up
> the rest of the compiler (and developers) to not have to worry about
> these things. Intrinsics would have to be maintained throughout the
> entire compilation process in every pass.
>

This is a very important point, and I believe this is why we went with the
convergent attribute a few years ago.
I passively followed this thread from the beginning, but something that I
think would really be key to start considering this "RFC" is an actual
LangRef proposal.
It seems hard to me to get an idea about the proposed trade-off of adding
the burden of this new "thread-aware" semantic on the IR semantic itself
(and on every existing and future transformations) without seeing the
concrete proposal.

Thanks,

-- 
Mehdi

>
> > The only way to
> > conclusively prove that transforms will never break code that's
> > supposed to work (except for bugs in the transforms) is to define the
> > semantics of the IR, and then to make sure that it refines the
> > semantics of the source language. This is why the LangRef almost never
> > talks about allowed/disallowed transforms except as examples to
> > explain some given semantics, and if you don't follow that rule, your
> > patch will probably be rejected.
>
> I'm trying to figure out if we are in the "almost never" territory
> here or not.
>
> > Now, to define a semantics for ballot(), we need to define what it's
> > allowed to return for any given execution of any given program, and to
> > do that, we need to define which threads must be active together when
> > it's reached, which in turns means we need to define when control flow
> > can re-converge somehow. Any proposal for how to handle ballot() must
> > start here.
>
> > > I'm not sure if using these rules will be easier or harder than
> dealing with
> > > intrinsics. One problem is that the rules for single-threaded code
> might seem
> > > arbitrary, and it would be hard to reason about them in a larger
> context.
>
> > > What happens to new control flow created by transforms, and what will
> guide
> > > the insertion of intrinsics in the new code? Code structurization is
> one example
> > > were this could happen.
>
> > Hopefully, it's clear from the above how this all falls out. The
> > frontend for e.g. GLSL or SPIR-V would have to insert the merge
> > intrinsics to preserve the semantics of the source language. Any
> > transforms must refine the semantics of the IR, although I can't think
> > of a scenario where that would involve emitting any new merge
> > intrinsics. Usually, structurized IR's have their own semantics about
> > when control flow converges, so a code structurizer should respect the
> > original semantics. AMDGPU has its own code structurizer that runs
> > very late in the pipeline (although there are plans to remove it), so
> > we might have to change that to make it respect the merge intrinsic
> > intrinsics, and then we'll correctly implement them "for free".
>
> Any transform that re-arranges control flow would potentially have to
> know about the properties of ballot(), and the rules with respect to
> the CFG (and maybe consider the target) to know where to insert the
> intrinsics. I had the impression that the control flow convergence was
> in part specified by what the target architecture can handle. One of
> the more complicated cases would be linearization where the control
> flow is completely rewritten, and is encoded in a variable that says
> which basic block is the next one to execute. Another case is DCE,
> where a ballot() could be eliminated, and it would potentially have to
> remove a number of intrinsics to enable later optimizations (unless it
> would affect performance?), so the intrinsics will require some
> non-local updates.
>
> > > > Would they only be present if ballot and similar functions are used,
> or do they
> > > > have to be present everywhere?
> > >
> > > They'd only have to be present when ballot or other convergent
> > > functions are called, since otherwise it doesn't matter when control
> > flow re-converges. However, we may want to keep them around for
> > performance reasons (usually destroying convergence points is bad for
> > performance).
>
> So we might have them without a ballot(), which would seem to make it
> more difficult for structurizers or other transforms to maintain the
> semantics and insert intrinsics.
>
> > > How are uniform branches handled? Do they affect the convergence model?
> > >
> > We may be able to remove convergence points if branches are uniform.
> > In Nicolai's proposal, I believe we'd want to remove a merge intrinsic
> > when all the branches that it post-dominates that aren't also
> > post-dominated by some other merge intrinsic are uniform.
>
> I couldn't quite understand the last sentence, but I assume the
> conditions would prevent removing convergence points that help
> performance. Post-domination might not be adequate if there are loops
> btw.
>
> - Jan
>
>
>
> On Wednesday, January 30, 2019, 6:29:52 AM EST, Connor Abbott <
> cwabbott0 at gmail.com> wrote:
>
>
> On Mon, Jan 28, 2019 at 9:09 PM Jan Sjodin <jan_sjodin at yahoo.com> wrote:
> >
> > > for (int i = 0; i < 2; i++) {
> > >  foo = ballot(true); // ballot 1
> > >
> > >    if (threadID /* ID of the thread within a wavefront/warp */ % 2 ==
> 0) continue;
> > >
> > >    bar = ballot(true); // ballot 2
> > > }
> > >
> > > versus:
> > >
> > > int i = 0;
> > > while (true) {
> > >    do {
> > >        if (i == 2) break;
> > >        foo = ballot(true); // ballot 1
> > >        i++;
> > >    } while (threadID % 2 == 0);
> > >
> > >    if (i == 2) break;
> > >    bar = ballot(true); // ballot 2
> > >    i++;
> > > }
> >
> > I think you can remove the second "i++", otherwise we can increment "i"
> twice
> > if threadID % 2 != 0, but I see the issue. Using the equivalence classes
> would
> > prevent this transform, since we have re-arranged the control flow in
> that way,
>
> What do you mean? Note that the transforms that lead from the first
> example to the second are actually desirable if there aren't any
> ballots or other convergent operations, so banning them entirely is a
> no-go. Otherwise, note that the ballot here can be nested arbitrarily
> deep, which means that jump forwarding/tail duplication has to be
> aware of the entire loop unless we have some kind of merge intrinsic.
>
> Also, I should say that while interesting, control equivalence classes
> can be part of the *implementation*, but not the *specification*. We
> need to define the semantics of the IR -- that is, how a program
> compiled from any given IR is allowed to do when executed (and *not*
> what it looks like when compiled) -- *first*, and then which
> transforms are allowed/not allowed will fall out of that. We can't
> start by listing disallowed transforms, because then when someone
> comes along and writes a new optimization, it might be technically
> "allowed" even though it breaks something in practice. The only way to
> conclusively prove that transforms will never break code that's
> supposed to work (except for bugs in the transforms) is to define the
> semantics of the IR, and then to make sure that it refines the
> semantics of the source language. This is why the LangRef almost never
> talks about allowed/disallowed transforms except as examples to
> explain some given semantics, and if you don't follow that rule, your
> patch will probably be rejected.
>
> Now, to define a semantics for ballot(), we need to define what it's
> allowed to return for any given execution of any given program, and to
> do that, we need to define which threads must be active together when
> it's reached, which in turns means we need to define when control flow
> can re-converge somehow. Any proposal for how to handle ballot() must
> start here.
>
> > I'm not sure if using these rules will be easier or harder than dealing
> with
> > intrinsics. One problem is that the rules for single-threaded code might
> seem
> > arbitrary, and it would be hard to reason about them in a larger context.
> >
> > > Nicolai's proposal solves this by having the frontend emit a merge
> intrinsic
> > > before the i++ is emitted. This prevents the jump forwarding from
> occurring.
> >
> > I was thinking about getting through the single-thread view and the
> issues with
> > that first, then I will think more about the multi-thread and explicit
> convergence.
> >
> > If we are done with the single-thread stuff for now these are the
> question that I
> > have been thinking about with the multi-threaded view:
> >
> > What happens to new control flow created by transforms, and what will
> guide
> > the insertion of intrinsics in the new code? Code structurization is one
> example
> > were this could happen.
>
> Hopefully, it's clear from the above how this all falls out. The
> frontend for e.g. GLSL or SPIR-V would have to insert the merge
> intrinsics to preserve the semantics of the source language. Any
> transforms must refine the semantics of the IR, although I can't think
> of a scenario where that would involve emitting any new merge
> intrinsics. Usually, structurized IR's have their own semantics about
> when control flow converges, so a code structurizer should respect the
> original semantics. AMDGPU has its own code structurizer that runs
> very late in the pipeline (although there are plans to remove it), so
> we might have to change that to make it respect the merge intrinsic
> intrinsics, and then we'll correctly implement them "for free".
>
> >
> > Would they only be present if ballot and similar functions are used, or
> do they
> > have to be present everywhere?
>
> They'd only have to be present when ballot or other convergent
> functions are called, since otherwise it doesn't matter when control
> flow re-converges. However, we may want to keep them around for
> performance reasons (usually destroying convergence points is bad for
> performance).
>
> >
> > How are uniform branches handled? Do they affect the convergence model?
>
> We may be able to remove convergence points if branches are uniform.
> In Nicolai's proposal, I believe we'd want to remove a merge intrinsic
> when all the branches that it post-dominates that aren't also
> post-dominated by some other merge intrinsic are uniform.
>
>
>
> >
> >
> > - Jan
> >
> >
> > On Monday, January 28, 2019, 11:16:36 AM EST, Connor Abbott <
> cwabbott0 at gmail.com> wrote:
> >
> >
> >
> > On Fri, Jan 25, 2019 at 3:05 AM Jan Sjodin <jan_sjodin at yahoo.com> wrote:
> > >
> > > > for (...) {
> > > >    ballot();
> > > >    if (... /* non-uniform */) continue;
> > > > }
> > > >
> > > > into
> > > >
> > > > for (...) {
> > > >    do {
> > > >        ballot();
> > > >    } while (... /* non-uniform */);
> > > > }
> > >
> > > I'm not sure if I follow this example, could you and explain a bit
> more?
> > > It looks to me that the condition in the "if" must be false (if the
> > > same condition is used in the while), or we would
> > > call ballot the wrong number of times.
> >
> > Yes, the idea is that the same condition is used in the if and the
> do-while. I think I messed up the example a little... in the second
> snippet, we're supposed to break out of the inner loop if the outer loop's
> exit condition is true. Here's a more concrete example:
> >
> > for (int i = 0; i < 2; i++) {
> >    foo = ballot(true); // ballot 1
> >
> >    if (threadID /* ID of the thread within a wavefront/warp */ % 2 == 0)
> continue;
> >
> >    bar = ballot(true); // ballot 2
> > }
> >
> > versus:
> >
> > int i = 0;
> > while (true) {
> >    do {
> >        if (i == 2) break;
> >        foo = ballot(true); // ballot 1
> >        i++;
> >    } while (threadID % 2 == 0);
> >
> >    if (i == 2) break;
> >    bar = ballot(true); // ballot 2
> >    i++;
> > }
> >
> > From a single-threaded perspective, these two snippets are identical,
> even if ballot() writes to arbitrary memory. The first can easily get
> transformed to something like the second when LLVM decides to duplicate the
> final i++ through jump forwarding, and then re-interprets the loop as two
> nested loops and splits the loop header in two. This is what currently
> happens with DOOM when we try to enable subgroup operations with it. Let's
> say there are two threads in a wavefront. Then the execution trace mandated
> by SPIR-V for the first looks like:
> >
> > thread 0        | thread 1
> > ballot 1 = 0b11 | ballot 1 = 0b11
> > skipped        | ballot 2 = 0b10
> > ballot 1 = 0b11 | ballot 1 = 0b11
> > skipped        | ballot 2 = 0b10
> >
> > Now, contrast this with the execution trace that programmers would
> expect for the second example:
> >
> > thread 0        | thread 1
> > ballot 1 = 0b11 | ballot 1 = 0b11
> > ballot 1 = 0b01 | skipped
> > skipped        | ballot 2 = 0b10
> > skipped        | ballot 1 = 0b10
> > skipped        | ballot 2 = 0b10
> >
> > Nicolai's proposal solves this by having the frontend emit a merge
> intrinsic before the i++ is emitted. This prevents the jump forwarding from
> occurring.
> >
> >
> > >
> > > About the CSE, when would that be legal? I can imagine with uniform
> > > branches that it could work, but would like to see an example to
> > > fully understand this.
> > >
> > > I agree that it would be more conservative than if we model the
> threading,
> > > but I'm not sure about the cost/benefit. I am mostly curious if it is
> > > possible to have a single-thread view or not. Then we would have to
> > > see if it is adequate.
> > >
> > > - Jan
> > >
> > > On Thursday, January 24, 2019, 10:31:47 AM EST, Connor Abbott <
> cwabbott0 at gmail.com> wrote:
> > >
> > >
> > > I don't see how this would fix the continue vs. nested loop problem I
> > > explained earlier. That is, how would this prevent turning:
> > >
> > > for (...) {
> > >    ballot();
> > >    if (... /* non-uniform */) continue;
> > > }
> > >
> > > into
> > >
> > > for (...) {
> > >    do {
> > >        ballot();
> > >    } while (... /* non-uniform */);
> > > }
> > >
> > > and vice versa? Note that there's no duplication going on here, and
> > > the single-threaded flow of control is exactly the same.
> > >
> > > Another reason this isn't so great is that it prevents e.g. CSE on
> > > ballots that actually should be combined, since you're modelling it as
> > > a write. It seems like the optimizer is going to need some special
> > > knowledge of convergent things that fake memory constraints can't give
> > > us.
> > >
> > > On Thu, Jan 24, 2019 at 4:06 PM Jan Sjodin <jan_sjodin at yahoo.com>
> wrote:
> > > >
> > > >
> > > > I was looking into ballot() and how if it is possible to keep a
> single-threaded
> > > > view of the code, but add some extra conditions that must hold after
> the
> > > > transformation. I had the initial idea that each call to ballot() in
> a
> > > > single-threaded program can be seen as a partial write to a memory
> > > > location, and each location memory location is unique for every call
> site,
> > > > plus there some externally observable side effect. We can abstract
> this
> > > > away by tagging the calls, e.g. by using aliases.
> > > >
> > > > For example:
> > > >
> > > > if (...) {
> > > >      foo1 = ballot();
> > > > } else {
> > > >      foo2 = ballot();
> > > > }
> > > >
> > > > simply becomes:
> > > >
> > > > if (...) {
> > > >      foo1 = ballot_1();
> > > > } else {
> > > >      foo2 = ballot_2();
> > > > }
> > > >
> > > >
> > > > and
> > > >
> > > > if (...) {
> > > > } else {
> > > > }
> > > > ballot();
> > > >
> > > > becomes
> > > >
> > > > if (...) {
> > > > } else {
> > > > }
> > > > ballot_1();
> > > >
> > > > In the first case it would prevent combining the two calls into one
> > > > after the if. In the second example there is generally nothing that
> > > > says it could not be transformed into the first example with two
> > > > calls to ballot_1(), which should not be allowed.
> > > >
> > > > Another form of duplication that we must allow are loop transforms,
> > > > like unrolling or peeling. These might seem similar to the example
> > > > above, since we are cloning code and with conditions etc. But
> > > > they are different since they calls are in different loop iterations.
> > > >
> > > > The condition that needs to be met is that:
> > > >
> > > > There must be a single path between all cloned ballot_n() functions.
> > > >
> > > > The reason for this condition is that if we clone the same call, then
> > > > the copies must be mutually exclusive, but if they are cloned from
> > > > a loop, there must be a path, or we would skip iterations.
> > > >
> > > > If we want to be more sophisticated we can add:
> > > >
> > > > If there is no such path, the calls must be separated by uniform
> branches.
> > > >
> > > > After the transform everything should be re-tagged, since we already
> > > > checked the calls and we don't want to check them again. Also, not
> all
> > > > transforms need (or should) have the tagging checked. One example is
> > > > inlining, where multiple copies are created, but they are clearly
> different
> > > > calls. The tagging can be done temporarily for a single pass, and
> then
> > > > eliminated. This could be a good tool for debugging as well, since
> it can
> > > > detect if a transform is suspect.
> > > >
> > > > The code would of course have to make sense as far as control flow.
> If
> > > > we have:
> > > >
> > > > for(;;) {
> > > >    if(...) {
> > > >      ballot();
> > > >      break;
> > > >    }
> > > > }
> > > >
> > > > This would have to be translated to:
> > > >
> > > > for(;;) {
> > > >    if(...) {
> > > >      ballot();
> > > >      if(UnknownConstant) {  // Not a uniform condition, but later on
> translated to "true"
> > > >        break;
> > > >      }
> > > > }
> > > >
> > > > However, I think that this is the way the code is generated today
> anyway.
> > > > There would have to be some attribute that indicate that these calls
> (or functions)
> > > > contain ballot or other cross-lane operations so they could be
> tagged and
> > > > checked. The attribute would be used by the passes to know that the
> special
> > > > conditions exist for those calls.
> > > >
> > > > As far as what it means to have a path, it could be complicated.
> > > > For example:
> > > >
> > > > x = ...
> > > > ballot_1();
> > > >
> > > > could be transformed to:
> > > >
> > > > if (x < 4711) {
> > > >  ballot_1();
> > > >
> > > > if(x >= 4711) {
> > > >  ballot_1();
> > > > }
> > > >
> > > > So a simple path check would say there is a path, and the transform
> is legal,
> > > > but if we examine the conditions, there is no path, and the
> transform should not be legal.
> > > > It could be made even more obscure of course, but I don't see any
> optimizations really
> > > > doing this kind of thing,
> > > >
> > > > - Jan
> > > >
> > > > On Saturday, December 29, 2018, 11:32:25 AM EST, Nicolai Hähnle via
> llvm-dev <llvm-dev at lists.llvm.org> wrote:
> > > >
> > > >
> > > > On 20.12.18 18:03, Connor Abbott wrote:
> > > > >    We already have the notion of "convergent" functions like
> > > > >    syncthreads(), to which we cannot add control-flow dependencies.
> > > > >    That is, it's legal to hoist syncthreads out of an "if", but
> it's
> > > > >    not legal to sink it into an "if".  It's not clear to me why we
> > > > >    can't have "anticonvergent" (terrible name) functions which
> cannot
> > > > >    have control-flow dependencies removed from them?  ballot()
> would be
> > > > >    both convergent and anticonvergent.
> > > > >
> > > > >    Would that solve your problem?
> > > > >
> > > > >
> > > > > I think it's important to note that we already have such an
> attribute,
> > > > > although with the opposite sense - it's impossible to remove
> control
> > > > > flow dependencies from a call unless you mark it as "speculatable".
> > > >
> > > > This isn't actually true. If both sides of an if/else have the same
> > > > non-speculative function call, it can still be moved out of control
> flow.
> > > >
> > > > That's because doing so doesn't change anything at all from a
> > > > single-threaded perspective. Hence why I think we should model the
> > > > communication between threads honestly.
> > > >
> > > >
> > > > > However, this doesn't prevent
> > > > >
> > > > > if (...) {
> > > > > } else {
> > > > > }
> > > > > foo = ballot();
> > > > >
> > > > > from being turned into
> > > > >
> > > > > if (...) {
> > > > >      foo1 = ballot();
> > > > > } else {
> > > > >      foo2 = ballot();
> > > > > }
> > > > > foo = phi(foo1, foo2)
> > > > >
> > > > > and vice versa. We have a "noduplicate" attribute which prevents
> > > > > transforming the first into the second, but not the other way
> around. Of
> > > > > course we could keep going this way and add a "nocombine"
> attribute to
> > > > > complement noduplicate. But even then, there are even still
> problematic
> > > > > transforms. For example, take this program, which is simplified
> from a
> > > > > real game that doesn't work with the AMDGPU backend:
> > > > >
> > > > > while (cond1 /* uniform */) {
> > > > >      ballot();
> > > > >      ...
> > > > >      if (cond2 /* non-uniform */) continue;
> > > > >      ...
> > > > > }
> > > > >
> > > > > In SPIR-V, when using structured control flow, the semantics of
> this are
> > > > > pretty clearly defined. In particular, there's a continue block
> after
> > > > > the body of the loop where control flow re-converges, and the only
> back
> > > > > edge is from the continue block, so the ballot is in uniform
> control
> > > > > flow. But LLVM will get rid of the continue block since it's
> empty, and
> > > > > re-analyze the loop as two nested loops, splitting the loop header
> in
> > > > > two, producing a CFG which corresponds to this:
> > > > >
> > > > > while (cond1 /* uniform */) {
> > > > >      do {
> > > > >          ballot();
> > > > >          ...
> > > > >      } while (cond2 /* non-uniform */);
> > > > >      ...
> > > > > }
> > > > >
> > > > > Now, in an implementation where control flow re-converges at the
> > > > > immediate post-dominator, this won't do the right thing anymore. In
> > > > > order to handle it correctly, you'd effectively need to always
> flatten
> > > > > nested loops, which will probably be really bad for performance if
> the
> > > > > programmer actually wanted the second thing. It also makes it
> impossible
> > > > > when translating a high-level language to LLVM to get the "natural"
> > > > > behavior which game developers actually expect. This is exactly
> the sort
> > > > > of "spooky action at a distance" which makes me think that
> everything
> > > > > we've done so far is really insufficient, and we need to add an
> explicit
> > > > > notion of control-flow divergence and reconvergence to the IR. We
> need a
> > > > > way to say that control flow re-converges at the continue block,
> so that
> > > > > LLVM won't eliminate it, and we can vectorize it correctly without
> > > > > penalizing cases where it's better for control flow not to
> re-converge.
> > > >
> > > > Well said!
> > > >
> > > > Cheers,
> > > >
> > > > Nicolai
> > > > --
> > > > Lerne, wie die Welt wirklich ist,
> > > > Aber vergiss niemals, wie sie sein sollte.
> > > > _______________________________________________
> > > > LLVM Developers mailing list
> > > > llvm-dev at lists.llvm.org
> > > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190130/fdf707e7/attachment-0001.html>