[llvm-dev] [RFC] Adding thread group semantics to LangRef (motivated by GPUs)
Nicolai Hähnle via llvm-dev
llvm-dev at lists.llvm.org
Sat Feb 9 12:01:50 PST 2019
On 09.02.19 18:06, Connor Abbott wrote:
> On Sat, Feb 9, 2019 at 4:44 PM Jan Sjodin <jan_sjodin at yahoo.com
> <mailto:jan_sjodin at yahoo.com>> wrote:
>
> > The reason I'm looking for solutions that can work without "scanning the
> > code" or "spooky action at a distance" is that we should have a solution
> > that's easily digestible by folks who are not aware of GPU execution models.
> >
> > The fallback model for such folks should be: if your pass doesn't touch
> > specially-marked calls (whether that marker is "convergent" or something
> > else), then you don't have to worry. Which, as a corollary, means that
> > transforms can be conservatively correct by not touching such calls.
> >
> > I don't see how to achieve this goal while forcing passes to do specific
> > code scans.
>
> I'm wondering if using instructions/intrinsics is the right thing, or
> is it a property of a basic block that control flow converges? Can the
> intrinsics exist anywhere in a basic block? Is there a difference
> between instructions in the same BB before and after a convergence
> instrinsic?
>
>
> The way I was thinking, we'd allow the merge instruction anywhere in a
> basic block. Otherwise, we'd have to allow for cases where there's a
> trivial edge from a basic block with one sucessor to a block with one
> predecessor that can't be eliminated. For example, in the single-break
> case mentioned earlier:
>
> while (true) {
> if (...) {
> ballot(); // should be executed before re-converging.
> break;
> }
> }
> ballot(); // should be executed after re-converging
>
> With a merge instruction, the two ballots could be in the same basic
> block, with a merge instruction separating them, versus two basic blocks
> with the second marked as a merge block. Of course, this does have the
> disadvantage that the backend would have to duplicate any convergent
> operations before the merge instruction/intrinsic if the block has
> multiple predecessors. I think someone will have to actually implement
> it and see which is better.
Agreed.
> If the IR will be multi-threaded by default (otherwise a flag for a
> module would be needed), then I'm thinking that we would need a more
> fundamental change to the IR to represent this, especially since
> convergence can be specified without the presence of cross-lane
> functions.
>
>
> I think that whatever we do, we definitely shouldn't allow any
> convergent calls in a function unless the function definition itself is
> marked convergent. That is, no one has to worry about cross-thread
> operations in a function, and passes can even delete merge instructions
> (although not recommended for performance!) unless it's marked
> convergent. I believe this is already the case now. And since
> convergence has to be specified somehow as soon as there are any
> convergent calls, I don't think it's that much more radical to be able
> to specify convergence without them. We need a fundamental change to the
> IR to be able to correctly handle ballot() and friends anyways, and once
> you accept that, then some kind of ability to explicitly specify merge
> points isn't really all that much more of a change, and makes it a lot
> easier to adapt optimization passes.
>
>
> >>> It's certainly interesting to think about how to maintain correctness
> >>> in the face of ballots() with such a pass, but a) it's certainly no
> >>> harder with merge intrinsics than merges being implicit and b) I doubt
> >>> that's useful for anything you'd want to do with a GPU.
> >>
> >> Irreducible control flow has to be handled somehow, and linearization
> >> is the only transform I know of, that will handle everything. I'm not
> >> sure what the execution model says about irreducible control flow.
> >
> > For what it's worth, the reconverging CFG approach can also handle
> > arbitrary irreducible control flow.
> >
> > Whether (or how) it can do this while being compatible with whatever
> > semantics we come up with for cross-lane operations remains to be seen.
>
> I am wondering about execution order and if that is part of the
> specification?
> For example:
>
> if (A) {
> L1:
> ballot();
> if (B)
> goto L2;
> else
> goto END:
> } else {
> L2:
> ballot();
> if (C)
> goto L1;
> else
> goto END;
> }
> END:
>
> Should the L1 ballot be executed first, or the L2, or does it not
> matter? I'm not sure what the ballots should return. Would swapping
> the branches from A (use !A instead of A) be a legal transform?
>
>
> The proposal doesn't, and I don't think we should, specify which order
> the diverging threads should execute in.
Agreed.
Basically if your high-level source code contains goto's, you should
just be on your own - within reason, to ensure composability of code.
For example, a reasonable rule would be that ballot & friends are
undefined after a goto instruction, until execution leaves the innermost
scope which is SESE (cannot be entered or left by goto jumps).
Cheers,
Nicolai
> That's going to be highly
> architecture-specific, and might even be determined at runtime. That is,
> threads that have diverged should be treated just like completely
> separate threads. (There's a subtle difference wrt atomics and forward
> progress guarantees on some architectures, since execution of divergent
> threads is usually serialized, though... but close enough). So yes,
> swapping the branch condition should be allowed.
>
> For your example, I think that under Nicolai's proposal only a merge
> instruction after END would do anything (since none of the other basic
> blocks post-dominate anything). Then control flow would only be
> guaranteed to re-converge at the very end, and each unique path through
> the loop could eventually split into its own thread group. We might want
> to allow control to converge earlier, to make it easier to map this to
> structured control flow. Of course, if you're writing GPU-oriented code
> with divergent *and* irreducible control flow like that, then... well,
> there's not much hope anyways :)
>
>
> - Jan
>
>
>
> On Friday, February 1, 2019, 4:57:54 PM EST, Nicolai Hähnle
> <nhaehnle at gmail.com <mailto:nhaehnle at gmail.com>> wrote:
>
>
> On 31.01.19 15:59, Jan Sjodin wrote:
> >> > Any transform that re-arranges control flow would potentially
> have to
> >> > know about the properties of ballot(), and the rules with
> respect to
> >> > the CFG (and maybe consider the target) to know where to
> insert the
> >> > intrinsics.
> >
> >> But the same is true for basically any approach to handling this. In
> >> fact, adding the merge intrinsics makes this much easier. Beyond the
> >> usual problems with hoisting ballots, which passes currently
> avoid via
> >> the current convergent + speculatable attributes, we'd only have to
> >> add awareness to passes that they can't duplicate/combine merge
> >> intrinsics or move them past convergent intrinsics, which is a local
> >> property and hence easily checkable. In example I explained, without
> >> some kind of merge intrinsic, tail duplication has to look at the
> >> entire loop to know whether the transform is legal. Of course, you
> >> could have some kind of "no convergent calls" flag on a
> function, but
> >> that doesn't eliminate the nastyness when there are convergent
> calls.
> >
> > We will have to determine if the intrinsics are worth more
> compared to
> > scanning the code.
>
> The reason I'm looking for solutions that can work without "scanning
> the
> code" or "spooky action at a distance" is that we should have a
> solution
> that's easily digestible by folks who are not aware of GPU execution
> models.
>
> The fallback model for such folks should be: if your pass doesn't touch
> specially-marked calls (whether that marker is "convergent" or
> something
> else), then you don't have to worry. Which, as a corollary, means that
> transforms can be conservatively correct by not touching such calls.
>
> I don't see how to achieve this goal while forcing passes to do
> specific
> code scans.
>
>
> >> > I had the impression that the control flow convergence was
> >> > in part specified by what the target architecture can handle.
> >
> >> This is true, although hopefully we can agree on something that
> >> everyone can implement.
> >
> > Yes, hopefully there is an execution model that is works for
> everyone.
> >
> >> > One of
> >> > the more complicated cases would be linearization where the
> control
> >> > flow is completely rewritten, and is encoded in a variable
> that says
> >> > which basic block is the next one to execute.
> >
> >> It's certainly interesting to think about how to maintain
> correctness
> >> in the face of ballots() with such a pass, but a) it's certainly no
> >> harder with merge intrinsics than merges being implicit and b) I
> doubt
> >> that's useful for anything you'd want to do with a GPU.
> >
> > Irreducible control flow has to be handled somehow, and linearization
> > is the only transform I know of, that will handle everything. I'm not
> > sure what the execution model says about irreducible control flow.
>
> For what it's worth, the reconverging CFG approach can also handle
> arbitrary irreducible control flow.
>
> Whether (or how) it can do this while being compatible with whatever
> semantics we come up with for cross-lane operations remains to be seen.
>
>
> >> > Another case is DCE,
> >> > where a ballot() could be eliminated, and it would
> potentially have to
> >> > remove a number of intrinsics to enable later optimizations
> (unless it
> >> > would affect performance?), so the intrinsics will require some
> >> > non-local updates.
> >
> >> Removing merge intrinsics if it's profitable and allowed is a
> >> separate optimization, that can be done relatively quickly in a
> single
> >> pass without impacting the performance of other optimizations. It's
> >> requiring expensive non-local checks in optimizations which modify
> >> control flow that's a problem.
> >
> > There are certainly a lot of tradeoffs. My point was simply that the
> > intrinsics are not strictly local.
>
> Right. I have some ideas for how to fix this, but wasn't able to
> work on
> them for the last weeks.
>
> Cheers,
> Nicolai
>
>
> >> > So we might have them without a ballot(), which would seem to
> make it
> >> > more difficult for structurizers or other transforms to
> maintain the
> >> > semantics and insert intrinsics.
> >
> >> It's not any more difficult code-wise, since the case where there is
> >> a ballot needs to be handled anyways. And while it might take longer
> >> to process stuff when this hypothetical pass encounters a merge
> >> intrinsic (I can't think of a real-world case where it would
> matter),
> >> it should result in better code generated.
> >
> > I was thinking of linearization or something similar, where an
> > instrinsic by itself might be hard to preserve, compared to
> looking at
> > a ballot() and derive where intrinsics should be inserted.
> >
> >> > > > How are uniform branches handled? Do they affect the
> convergence model?
> >> > > >
> >> > > We may be able to remove convergence points if branches are
> uniform.
> >> > > In Nicolai's proposal, I believe we'd want to remove a merge
> intrinsic
> >> > > when all the branches that it post-dominates that aren't also
> >> > > post-dominated by some other merge intrinsic are uniform.
> >> >
> >> > I couldn't quite understand the last sentence, but I assume the
> >> > conditions would prevent removing convergence points that help
> >> > performance. Post-domination might not be adequate if there
> are loops
> >> > btw.
> >>
> >> It should be -- Nicolai's proposal is that a merge intrinsic merges
> >> all the divergences caused by branches post dominated by the
> >> intrinsic, so if all the divergent branches are merged by some other
> >> intrinsic earlier, then there's no divergence and the merge
> intrinsic
> >> is a no-op.
> >
> > Makes sense.
> >
> > - Jan
> > On Wednesday, January 30, 2019, 11:41:29 AM EST, Connor Abbott
> > <cwabbott0 at gmail.com <mailto:cwabbott0 at gmail.com>> wrote:
> >
> >
> > On Wed, Jan 30, 2019 at 4:20 PM Jan Sjodin <jan_sjodin at yahoo.com
> <mailto:jan_sjodin at yahoo.com>
> > <mailto:jan_sjodin at yahoo.com <mailto:jan_sjodin at yahoo.com>>> wrote:
> >
> >
> > >
> > > > > for (int i = 0; i < 2; i++) {
> > > > > foo = ballot(true); // ballot 1
> > > > >
> > > > > if (threadID /* ID of the thread within a
> wavefront/warp */ % 2 == 0) continue;
> > > > >
> > > > > bar = ballot(true); // ballot 2
> > > > > }
> > > > >
> > > > > versus:
> > > > >
> > > > > int i = 0;
> > > > > while (true) {
> > > > > do {
> > > > > if (i == 2) break;
> > > > > foo = ballot(true); // ballot 1
> > > > > i++;
> > > > > } while (threadID % 2 == 0);
> > > > >
> > > > > if (i == 2) break;
> > > > > bar = ballot(true); // ballot 2
> > > > > i++;
> > > > > }
> > > >
> > > > I think you can remove the second "i++", otherwise we can
> increment "i" twice
> > > > if threadID % 2 != 0, but I see the issue. Using the
> equivalence classes would
> > > > prevent this transform, since we have re-arranged the
> control flow in that way,
> > >
> > > What do you mean? Note that the transforms that lead from
> the first
> > > example to the second are actually desirable if there aren't any
> > > ballots or other convergent operations, so banning them
> entirely is a
> > > no-go. Otherwise, note that the ballot here can be nested
> arbitrarily
> > > deep, which means that jump forwarding/tail duplication has
> to be
> > > aware of the entire loop unless we have some kind of merge
> intrinsic.
> >
> > Yes, if there weren't any calls to a ballot, then the
> transform would
> > be legal. What I was saying in the beginning was that ballot()
> would
> > have some special rules attached to it. It is of course
> undesirable to
> > have flags to enforce correctness, but that is the point we are at
> > right now.
> >
> > > Also, I should say that while interesting, control
> equivalence classes
> > > can be part of the *implementation*, but not the
> *specification*. We
> > > need
> > > to define the semantics of the IR -- that is, how a program
> > > compiled from any given IR is allowed to do when executed
> (and *not*
> > > what it looks like when compiled) -- *first*, and then which
> > > transforms are allowed/not allowed will fall out of that. We
> can't
> > > start by listing disallowed transforms, because then when
> someone
> > > comes along and writes a new optimization, it might be
> technically
> > > "allowed" even though it breaks something in practice.
> >
> > I think you misunderstand me if you think I was listing disallowed
> > transforms. My question was if it is possible to have
> multi-threaded
> > semantics in the source language, but in the compiler have a
> > single-threaded view, where some properties of the CFG would
> determine
> > what is legal and not for some functions with a special flag.
> >
> >
> > But that's practically the same as listing allowed and disallowed
> > transforms -- you're defining what the final IR is allowed to
> look like,
> > not how it's allowed to execute. If at all possible, the former
> should
> > be derived from the latter.
> >
> > I agree
> > it is more desirable to have the the semantics specified in the
> > IR. However, I am exploring this from a practical point of
> view, since
> > ballot() is very rare compared to all code that is being
> compiled for
> > all targets. These intrinsics would always have to be
> considered when
> > writing a pass. They seem to me harder to think about, and
> > test, for someone who is working on a single-threaded target,
> compared
> > to looking at a flag. If we allow ourselves to look at which
> > transforms that might violate these properties, we could would
> free up
> > the rest of the compiler (and developers) to not have to worry
> about
> > these things. Intrinsics would have to be maintained
> throughout the
> > entire compilation process in every pass.
> >
> > > The only way to
> > > conclusively prove that transforms will never break code that's
> > > supposed to work (except for bugs in the transforms) is to
> define the
> > > semantics of the IR, and then to make sure that it refines the
> > > semantics of the source language. This is why the LangRef
> almost never
> > > talks about allowed/disallowed transforms except as examples to
> > > explain some given semantics, and if you don't follow that
> rule, your
> > > patch will probably be rejected.
> >
> > I'm trying to figure out if we are in the "almost never" territory
> > here or not.
> >
> >
> > I don't think so at all. In addition to being much, much harder to
> > reason about and prove that the approach is sound, I think it'll
> be more
> > intrusive.
> >
> >
> > > Now, to define a semantics for ballot(), we need to define
> what it's
> > > allowed to return for any given execution of any given
> program, and to
> > > do that, we need to define which threads must be active
> together when
> > > it's reached, which in turns means we need to define when
> control flow
> > > can re-converge somehow. Any proposal for how to handle
> ballot() must
> > > start here.
> >
> > > > I'm not sure if using these rules will be easier or harder
> than dealing with
> > > > intrinsics. One problem is that the rules for
> single-threaded code might seem
> > > > arbitrary, and it would be hard to reason about them in a
> larger context.
> >
> > > > What happens to new control flow created by transforms,
> and what will guide
> > > > the insertion of intrinsics in the new code? Code
> structurization is one example
> > > > were this could happen.
> >
> > > Hopefully, it's clear from the above how this all falls out. The
> > > frontend for e.g. GLSL or SPIR-V would have to insert the merge
> > > intrinsics to preserve the semantics of the source language. Any
> > > transforms must refine the semantics of the IR, although I
> can't think
> > > of a scenario where that would involve emitting any new merge
> > > intrinsics. Usually, structurized IR's have their own
> semantics about
> > > when control flow converges, so a code structurizer should
> respect the
> > > original semantics. AMDGPU has its own code structurizer
> that runs
> > > very late in the pipeline (although there are plans to
> remove it), so
> > > we might have to change that to make it respect the merge
> intrinsic
> > > intrinsics, and then we'll correctly implement them "for free".
> >
> > Any transform that re-arranges control flow would potentially
> have to
> > know about the properties of ballot(), and the rules with
> respect to
> > the CFG (and maybe consider the target) to know where to
> insert the
> > intrinsics.
> >
> >
> > But the same is true for basically any approach to handling this. In
> > fact, adding the merge intrinsics makes this much easier. Beyond the
> > usual problems with hoisting ballots, which passes currently
> avoid via
> > the current convergent + speculatable attributes, we'd only have
> to add
> > awareness to passes that they can't duplicate/combine merge
> intrinsics
> > or move them past convergent intrinsics, which is a local
> property and
> > hence easily checkable. In example I explained, without some kind of
> > merge intrinsic, tail duplication has to look at the entire loop
> to know
> > whether the transform is legal. Of course, you could have some
> kind of
> > "no convergent calls" flag on a function, but that doesn't
> eliminate the
> > nastyness when there are convergent calls.
> >
> > I had the impression that the control flow convergence was
> > in part specified by what the target architecture can handle.
> >
> >
> > This is true, although hopefully we can agree on something that
> everyone
> > can implement.
> >
> > One of
> > the more complicated cases would be linearization where the
> control
> > flow is completely rewritten, and is encoded in a variable
> that says
> > which basic block is the next one to execute.
> >
> >
> > It's certainly interesting to think about how to maintain
> correctness in
> > the face of ballots() with such a pass, but a) it's certainly no
> harder
> > with merge intrinsics than merges being implicit and b) I doubt
> that's
> > useful for anything you'd want to do with a GPU.
> >
> > Another case is DCE,
> > where a ballot() could be eliminated, and it would potentially
> have to
> > remove a number of intrinsics to enable later optimizations
> (unless it
> > would affect performance?), so the intrinsics will require some
> > non-local updates.
> >
> >
> > Removing merge intrinsics if it's profitable and allowed is a
> separate
> > optimization, that can be done relatively quickly in a single pass
> > without impacting the performance of other optimizations. It's
> requiring
> > expensive non-local checks in optimizations which modify control
> flow
> > that's a problem.
> >
> >
> > > > > Would they only be present if ballot and similar
> functions are used, or do they
> > > > > have to be present everywhere?
> > > >
> > > > They'd only have to be present when ballot or other convergent
> > > > functions are called, since otherwise it doesn't matter
> when control
> > > flow re-converges. However, we may want to keep them around for
> > > performance reasons (usually destroying convergence points
> is bad for
> > > performance).
> >
> > So we might have them without a ballot(), which would seem to
> make it
> > more difficult for structurizers or other transforms to
> maintain the
> > semantics and insert intrinsics.
> >
> >
> > It's not any more difficult code-wise, since the case where there
> is a
> > ballot needs to be handled anyways. And while it might take
> longer to
> > process stuff when this hypothetical pass encounters a merge
> intrinsic
> > (I can't think of a real-world case where it would matter), it
> should
> > result in better code generated.
> >
> >
> > > > How are uniform branches handled? Do they affect the
> convergence model?
> > > >
> > > We may be able to remove convergence points if branches are
> uniform.
> > > In Nicolai's proposal, I believe we'd want to remove a merge
> intrinsic
> > > when all the branches that it post-dominates that aren't also
> > > post-dominated by some other merge intrinsic are uniform.
> >
> > I couldn't quite understand the last sentence, but I assume the
> > conditions would prevent removing convergence points that help
> > performance. Post-domination might not be adequate if there
> are loops
> > btw.
> >
> >
> > It should be -- Nicolai's proposal is that a merge intrinsic
> merges all
> > the divergences caused by branches post dominated by the
> intrinsic, so
> > if all the divergent branches are merged by some other intrinsic
> > earlier, then there's no divergence and the merge intrinsic is a
> no-op.
> >
> >
> > - Jan
> >
> >
> >
> > On Wednesday, January 30, 2019, 6:29:52 AM EST, Connor Abbott
> > <cwabbott0 at gmail.com <mailto:cwabbott0 at gmail.com>
> <mailto:cwabbott0 at gmail.com <mailto:cwabbott0 at gmail.com>>> wrote:
> >
> >
> > On Mon, Jan 28, 2019 at 9:09 PM Jan Sjodin
> <jan_sjodin at yahoo.com <mailto:jan_sjodin at yahoo.com>
> > <mailto:jan_sjodin at yahoo.com <mailto:jan_sjodin at yahoo.com>>>
> wrote:
> > >
> > > > for (int i = 0; i < 2; i++) {
> > > > foo = ballot(true); // ballot 1
> > > >
> > > > if (threadID /* ID of the thread within a
> wavefront/warp */
> > % 2 == 0) continue;
> > > >
> > > > bar = ballot(true); // ballot 2
> > > > }
> > > >
> > > > versus:
> > > >
> > > > int i = 0;
> > > > while (true) {
> > > > do {
> > > > if (i == 2) break;
> > > > foo = ballot(true); // ballot 1
> > > > i++;
> > > > } while (threadID % 2 == 0);
> > > >
> > > > if (i == 2) break;
> > > > bar = ballot(true); // ballot 2
> > > > i++;
> > > > }
> > >
> > > I think you can remove the second "i++", otherwise we can
> > increment "i" twice
> > > if threadID % 2 != 0, but I see the issue. Using the
> equivalence
> > classes would
> > > prevent this transform, since we have re-arranged the control
> > flow in that way,
> >
> > What do you mean? Note that the transforms that lead from the
> first
> > example to the second are actually desirable if there aren't any
> > ballots or other convergent operations, so banning them
> entirely is a
> > no-go. Otherwise, note that the ballot here can be nested
> arbitrarily
> > deep, which means that jump forwarding/tail duplication has to be
> > aware of the entire loop unless we have some kind of merge
> intrinsic.
> >
> > Also, I should say that while interesting, control equivalence
> classes
> > can be part of the *implementation*, but not the
> *specification*. We
> > need to define the semantics of the IR -- that is, how a program
> > compiled from any given IR is allowed to do when executed (and
> *not*
> > what it looks like when compiled) -- *first*, and then which
> > transforms are allowed/not allowed will fall out of that. We can't
> > start by listing disallowed transforms, because then when someone
> > comes along and writes a new optimization, it might be technically
> > "allowed" even though it breaks something in practice. The
> only way to
> > conclusively prove that transforms will never break code that's
> > supposed to work (except for bugs in the transforms) is to
> define the
> > semantics of the IR, and then to make sure that it refines the
> > semantics of the source language. This is why the LangRef
> almost never
> > talks about allowed/disallowed transforms except as examples to
> > explain some given semantics, and if you don't follow that
> rule, your
> > patch will probably be rejected.
> >
> > Now, to define a semantics for ballot(), we need to define
> what it's
> > allowed to return for any given execution of any given
> program, and to
> > do that, we need to define which threads must be active
> together when
> > it's reached, which in turns means we need to define when
> control flow
> > can re-converge somehow. Any proposal for how to handle
> ballot() must
> > start here.
> >
> > > I'm not sure if using these rules will be easier or harder
> than
> > dealing with
> > > intrinsics. One problem is that the rules for single-threaded
> > code might seem
> > > arbitrary, and it would be hard to reason about them in a
> larger
> > context.
> > >
> > > > Nicolai's proposal solves this by having the frontend emit a
> > merge intrinsic
> > > > before the i++ is emitted. This prevents the jump forwarding
> > from occurring.
> > >
> > > I was thinking about getting through the single-thread
> view and
> > the issues with
> > > that first, then I will think more about the multi-thread and
> > explicit convergence.
> > >
> > > If we are done with the single-thread stuff for now these
> are the
> > question that I
> > > have been thinking about with the multi-threaded view:
> > >
> > > What happens to new control flow created by transforms,
> and what
> > will guide
> > > the insertion of intrinsics in the new code? Code
> structurization
> > is one example
> > > were this could happen.
> >
> > Hopefully, it's clear from the above how this all falls out. The
> > frontend for e.g. GLSL or SPIR-V would have to insert the merge
> > intrinsics to preserve the semantics of the source language. Any
> > transforms must refine the semantics of the IR, although I
> can't think
> > of a scenario where that would involve emitting any new merge
> > intrinsics. Usually, structurized IR's have their own
> semantics about
> > when control flow converges, so a code structurizer should
> respect the
> > original semantics. AMDGPU has its own code structurizer that runs
> > very late in the pipeline (although there are plans to remove
> it), so
> > we might have to change that to make it respect the merge
> intrinsic
> > intrinsics, and then we'll correctly implement them "for free".
> >
> > >
> > > Would they only be present if ballot and similar functions are
> > used, or do they
> > > have to be present everywhere?
> >
> > They'd only have to be present when ballot or other convergent
> > functions are called, since otherwise it doesn't matter when
> control
> > flow re-converges. However, we may want to keep them around for
> > performance reasons (usually destroying convergence points is
> bad for
> > performance).
> >
> > >
> > > How are uniform branches handled? Do they affect the
> convergence
> > model?
> >
> > We may be able to remove convergence points if branches are
> uniform.
> > In Nicolai's proposal, I believe we'd want to remove a merge
> intrinsic
> > when all the branches that it post-dominates that aren't also
> > post-dominated by some other merge intrinsic are uniform.
> >
> >
> >
> > >
> > >
> > > - Jan
> > >
> > >
> > > On Monday, January 28, 2019, 11:16:36 AM EST, Connor Abbott
> > <cwabbott0 at gmail.com <mailto:cwabbott0 at gmail.com>
> <mailto:cwabbott0 at gmail.com <mailto:cwabbott0 at gmail.com>>> wrote:
> > >
> > >
> > >
> > > On Fri, Jan 25, 2019 at 3:05 AM Jan Sjodin
> <jan_sjodin at yahoo.com <mailto:jan_sjodin at yahoo.com>
> > <mailto:jan_sjodin at yahoo.com <mailto:jan_sjodin at yahoo.com>>>
> wrote:
> > > >
> > > > > for (...) {
> > > > > ballot();
> > > > > if (... /* non-uniform */) continue;
> > > > > }
> > > > >
> > > > > into
> > > > >
> > > > > for (...) {
> > > > > do {
> > > > > ballot();
> > > > > } while (... /* non-uniform */);
> > > > > }
> > > >
> > > > I'm not sure if I follow this example, could you and
> explain a
> > bit more?
> > > > It looks to me that the condition in the "if" must be
> false (if the
> > > > same condition is used in the while), or we would
> > > > call ballot the wrong number of times.
> > >
> > > Yes, the idea is that the same condition is used in the if and
> > the do-while. I think I messed up the example a little... in the
> > second snippet, we're supposed to break out of the inner loop
> if the
> > outer loop's exit condition is true. Here's a more concrete
> example:
> > >
> > > for (int i = 0; i < 2; i++) {
> > > foo = ballot(true); // ballot 1
> > >
> > > if (threadID /* ID of the thread within a
> wavefront/warp */ %
> > 2 == 0) continue;
> > >
> > > bar = ballot(true); // ballot 2
> > > }
> > >
> > > versus:
> > >
> > > int i = 0;
> > > while (true) {
> > > do {
> > > if (i == 2) break;
> > > foo = ballot(true); // ballot 1
> > > i++;
> > > } while (threadID % 2 == 0);
> > >
> > > if (i == 2) break;
> > > bar = ballot(true); // ballot 2
> > > i++;
> > > }
> > >
> > > From a single-threaded perspective, these two snippets are
> > identical, even if ballot() writes to arbitrary memory. The first
> > can easily get transformed to something like the second when LLVM
> > decides to duplicate the final i++ through jump forwarding,
> and then
> > re-interprets the loop as two nested loops and splits the loop
> > header in two. This is what currently happens with DOOM when
> we try
> > to enable subgroup operations with it. Let's say there are two
> > threads in a wavefront. Then the execution trace mandated by
> SPIR-V
> > for the first looks like:
> > >
> > > thread 0 | thread 1
> > > ballot 1 = 0b11 | ballot 1 = 0b11
> > > skipped | ballot 2 = 0b10
> > > ballot 1 = 0b11 | ballot 1 = 0b11
> > > skipped | ballot 2 = 0b10
> > >
> > > Now, contrast this with the execution trace that programmers
> > would expect for the second example:
> > >
> > > thread 0 | thread 1
> > > ballot 1 = 0b11 | ballot 1 = 0b11
> > > ballot 1 = 0b01 | skipped
> > > skipped | ballot 2 = 0b10
> > > skipped | ballot 1 = 0b10
> > > skipped | ballot 2 = 0b10
> > >
> > > Nicolai's proposal solves this by having the frontend emit a
> > merge intrinsic before the i++ is emitted. This prevents the jump
> > forwarding from occurring.
> > >
> > >
> > > >
> > > > About the CSE, when would that be legal? I can imagine
> with uniform
> > > > branches that it could work, but would like to see an
> example to
> > > > fully understand this.
> > > >
> > > > I agree that it would be more conservative than if we
> model the
> > threading,
> > > > but I'm not sure about the cost/benefit. I am mostly
> curious if
> > it is
> > > > possible to have a single-thread view or not. Then we
> would have to
> > > > see if it is adequate.
> > > >
> > > > - Jan
> > > >
> > > > On Thursday, January 24, 2019, 10:31:47 AM EST, Connor
> Abbott
> > <cwabbott0 at gmail.com <mailto:cwabbott0 at gmail.com>
> <mailto:cwabbott0 at gmail.com <mailto:cwabbott0 at gmail.com>>> wrote:
> > > >
> > > >
> > > > I don't see how this would fix the continue vs. nested loop
> > problem I
> > > > explained earlier. That is, how would this prevent turning:
> > > >
> > > > for (...) {
> > > > ballot();
> > > > if (... /* non-uniform */) continue;
> > > > }
> > > >
> > > > into
> > > >
> > > > for (...) {
> > > > do {
> > > > ballot();
> > > > } while (... /* non-uniform */);
> > > > }
> > > >
> > > > and vice versa? Note that there's no duplication going
> on here, and
> > > > the single-threaded flow of control is exactly the same.
> > > >
> > > > Another reason this isn't so great is that it prevents
> e.g. CSE on
> > > > ballots that actually should be combined, since you're
> > modelling it as
> > > > a write. It seems like the optimizer is going to need
> some special
> > > > knowledge of convergent things that fake memory constraints
> > can't give
> > > > us.
> > > >
> > > > On Thu, Jan 24, 2019 at 4:06 PM Jan Sjodin
> > <jan_sjodin at yahoo.com <mailto:jan_sjodin at yahoo.com>
> <mailto:jan_sjodin at yahoo.com <mailto:jan_sjodin at yahoo.com>>> wrote:
> > > > >
> > > > >
> > > > > I was looking into ballot() and how if it is possible
> to keep
> > a single-threaded
> > > > > view of the code, but add some extra conditions that must
> > hold after the
> > > > > transformation. I had the initial idea that each call to
> > ballot() in a
> > > > > single-threaded program can be seen as a partial write
> to a
> > memory
> > > > > location, and each location memory location is unique for
> > every call site,
> > > > > plus there some externally observable side effect. We can
> > abstract this
> > > > > away by tagging the calls, e.g. by using aliases.
> > > > >
> > > > > For example:
> > > > >
> > > > > if (...) {
> > > > > foo1 = ballot();
> > > > > } else {
> > > > > foo2 = ballot();
> > > > > }
> > > > >
> > > > > simply becomes:
> > > > >
> > > > > if (...) {
> > > > > foo1 = ballot_1();
> > > > > } else {
> > > > > foo2 = ballot_2();
> > > > > }
> > > > >
> > > > >
> > > > > and
> > > > >
> > > > > if (...) {
> > > > > } else {
> > > > > }
> > > > > ballot();
> > > > >
> > > > > becomes
> > > > >
> > > > > if (...) {
> > > > > } else {
> > > > > }
> > > > > ballot_1();
> > > > >
> > > > > In the first case it would prevent combining the two calls
> > into one
> > > > > after the if. In the second example there is generally
> > nothing that
> > > > > says it could not be transformed into the first
> example with two
> > > > > calls to ballot_1(), which should not be allowed.
> > > > >
> > > > > Another form of duplication that we must allow are loop
> > transforms,
> > > > > like unrolling or peeling. These might seem similar to the
> > example
> > > > > above, since we are cloning code and with conditions
> etc. But
> > > > > they are different since they calls are in different loop
> > iterations.
> > > > >
> > > > > The condition that needs to be met is that:
> > > > >
> > > > > There must be a single path between all cloned ballot_n()
> > functions.
> > > > >
> > > > > The reason for this condition is that if we clone the same
> > call, then
> > > > > the copies must be mutually exclusive, but if they are
> cloned
> > from
> > > > > a loop, there must be a path, or we would skip iterations.
> > > > >
> > > > > If we want to be more sophisticated we can add:
> > > > >
> > > > > If there is no such path, the calls must be separated by
> > uniform branches.
> > > > >
> > > > > After the transform everything should be re-tagged,
> since we
> > already
> > > > > checked the calls and we don't want to check them again.
> > Also, not all
> > > > > transforms need (or should) have the tagging checked. One
> > example is
> > > > > inlining, where multiple copies are created, but they are
> > clearly different
> > > > > calls. The tagging can be done temporarily for a
> single pass,
> > and then
> > > > > eliminated. This could be a good tool for debugging as
> well,
> > since it can
> > > > > detect if a transform is suspect.
> > > > >
> > > > > The code would of course have to make sense as far as
> control
> > flow. If
> > > > > we have:
> > > > >
> > > > > for(;;) {
> > > > > if(...) {
> > > > > ballot();
> > > > > break;
> > > > > }
> > > > > }
> > > > >
> > > > > This would have to be translated to:
> > > > >
> > > > > for(;;) {
> > > > > if(...) {
> > > > > ballot();
> > > > > if(UnknownConstant) { // Not a uniform
> condition, but
> > later on translated to "true"
> > > > > break;
> > > > > }
> > > > > }
> > > > >
> > > > > However, I think that this is the way the code is
> generated
> > today anyway.
> > > > > There would have to be some attribute that indicate that
> > these calls (or functions)
> > > > > contain ballot or other cross-lane operations so they
> could
> > be tagged and
> > > > > checked. The attribute would be used by the passes to know
> > that the special
> > > > > conditions exist for those calls.
> > > > >
> > > > > As far as what it means to have a path, it could be
> complicated.
> > > > > For example:
> > > > >
> > > > > x = ...
> > > > > ballot_1();
> > > > >
> > > > > could be transformed to:
> > > > >
> > > > > if (x < 4711) {
> > > > > ballot_1();
> > > > >
> > > > > if(x >= 4711) {
> > > > > ballot_1();
> > > > > }
> > > > >
> > > > > So a simple path check would say there is a path, and the
> > transform is legal,
> > > > > but if we examine the conditions, there is no path,
> and the
> > transform should not be legal.
> > > > > It could be made even more obscure of course, but I
> don't see
> > any optimizations really
> > > > > doing this kind of thing,
> > > > >
> > > > > - Jan
> > > > >
> > > > > On Saturday, December 29, 2018, 11:32:25 AM EST, Nicolai
> > Hähnle via llvm-dev <llvm-dev at lists.llvm.org
> <mailto:llvm-dev at lists.llvm.org>
> > <mailto:llvm-dev at lists.llvm.org
> <mailto:llvm-dev at lists.llvm.org>>> wrote:
> > > > >
> > > > >
> > > > > On 20.12.18 18:03, Connor Abbott wrote:
> > > > > > We already have the notion of "convergent"
> functions like
> > > > > > syncthreads(), to which we cannot add control-flow
> > dependencies.
> > > > > > That is, it's legal to hoist syncthreads out of
> an "if",
> > but it's
> > > > > > not legal to sink it into an "if". It's not
> clear to me
> > why we
> > > > > > can't have "anticonvergent" (terrible name) functions
> > which cannot
> > > > > > have control-flow dependencies removed from them?
> > ballot() would be
> > > > > > both convergent and anticonvergent.
> > > > > >
> > > > > > Would that solve your problem?
> > > > > >
> > > > > >
> > > > > > I think it's important to note that we already have
> such an
> > attribute,
> > > > > > although with the opposite sense - it's impossible to
> > remove control
> > > > > > flow dependencies from a call unless you mark it as
> > "speculatable".
> > > > >
> > > > > This isn't actually true. If both sides of an if/else have
> > the same
> > > > > non-speculative function call, it can still be moved
> out of
> > control flow.
> > > > >
> > > > > That's because doing so doesn't change anything at all
> from a
> > > > > single-threaded perspective. Hence why I think we should
> > model the
> > > > > communication between threads honestly.
> > > > >
> > > > >
> > > > > > However, this doesn't prevent
> > > > > >
> > > > > > if (...) {
> > > > > > } else {
> > > > > > }
> > > > > > foo = ballot();
> > > > > >
> > > > > > from being turned into
> > > > > >
> > > > > > if (...) {
> > > > > > foo1 = ballot();
> > > > > > } else {
> > > > > > foo2 = ballot();
> > > > > > }
> > > > > > foo = phi(foo1, foo2)
> > > > > >
> > > > > > and vice versa. We have a "noduplicate" attribute which
> > prevents
> > > > > > transforming the first into the second, but not the
> other
> > way around. Of
> > > > > > course we could keep going this way and add a
> "nocombine"
> > attribute to
> > > > > > complement noduplicate. But even then, there are
> even still
> > problematic
> > > > > > transforms. For example, take this program, which is
> > simplified from a
> > > > > > real game that doesn't work with the AMDGPU backend:
> > > > > >
> > > > > > while (cond1 /* uniform */) {
> > > > > > ballot();
> > > > > > ...
> > > > > > if (cond2 /* non-uniform */) continue;
> > > > > > ...
> > > > > > }
> > > > > >
> > > > > > In SPIR-V, when using structured control flow, the
> > semantics of this are
> > > > > > pretty clearly defined. In particular, there's a
> continue
> > block after
> > > > > > the body of the loop where control flow
> re-converges, and
> > the only back
> > > > > > edge is from the continue block, so the ballot is in
> > uniform control
> > > > > > flow. But LLVM will get rid of the continue block since
> > it's empty, and
> > > > > > re-analyze the loop as two nested loops, splitting
> the loop
> > header in
> > > > > > two, producing a CFG which corresponds to this:
> > > > > >
> > > > > > while (cond1 /* uniform */) {
> > > > > > do {
> > > > > > ballot();
> > > > > > ...
> > > > > > } while (cond2 /* non-uniform */);
> > > > > > ...
> > > > > > }
> > > > > >
> > > > > > Now, in an implementation where control flow
> re-converges
> > at the
> > > > > > immediate post-dominator, this won't do the right thing
> > anymore. In
> > > > > > order to handle it correctly, you'd effectively need to
> > always flatten
> > > > > > nested loops, which will probably be really bad for
> > performance if the
> > > > > > programmer actually wanted the second thing. It also
> makes
> > it impossible
> > > > > > when translating a high-level language to LLVM to
> get the
> > "natural"
> > > > > > behavior which game developers actually expect. This is
> > exactly the sort
> > > > > > of "spooky action at a distance" which makes me
> think that
> > everything
> > > > > > we've done so far is really insufficient, and we need to
> > add an explicit
> > > > > > notion of control-flow divergence and reconvergence
> to the
> > IR. We need a
> > > > > > way to say that control flow re-converges at the
> continue
> > block, so that
> > > > > > LLVM won't eliminate it, and we can vectorize it
> correctly
> > without
> > > > > > penalizing cases where it's better for control flow
> not to
> > re-converge.
> > > > >
> > > > > Well said!
> > > > >
> > > > > Cheers,
> > > > >
> > > > > Nicolai
> > > > > --
> > > > > Lerne, wie die Welt wirklich ist,
> > > > > Aber vergiss niemals, wie sie sein sollte.
> > > > > _______________________________________________
> > > > > LLVM Developers mailing list
> > > > > llvm-dev at lists.llvm.org
> <mailto:llvm-dev at lists.llvm.org> <mailto:llvm-dev at lists.llvm.org
> <mailto:llvm-dev at lists.llvm.org>>
> > > > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >
>
>
> --
> Lerne, wie die Welt wirklich ist,
> Aber vergiss niemals, wie sie sein sollte.
>
--
Lerne, wie die Welt wirklich ist,
Aber vergiss niemals, wie sie sein sollte.
More information about the llvm-dev
mailing list