<html><head></head><body><div class="ydp3c5fcfc6yahoo-style-wrap" style="font-family:Helvetica Neue, Helvetica, Arial, sans-serif;font-size:13px;"><div></div>
<div><span><div>> The reason I'm looking for solutions that can work without "scanning the </div><div>> code" or "spooky action at a distance" is that we should have a solution </div><div>> that's easily digestible by folks who are not aware of GPU execution models.</div><div>></div><div>> The fallback model for such folks should be: if your pass doesn't touch </div><div>> specially-marked calls (whether that marker is "convergent" or something </div><div>> else), then you don't have to worry. Which, as a corollary, means that </div><div>> transforms can be conservatively correct by not touching such calls.</div><div>></div><div>> I don't see how to achieve this goal while forcing passes to do specific </div><div>> code scans.</div><div><br></div><div>I'm wondering if using instructions/intrinsics is the right thing, or</div><div>is it a property of a basic block that control flow converges? Can the</div><div>intrinsics exist anywhere in a basic block? Is there a difference</div><div>between instructions in the same BB before and after a convergence</div><div>instrinsic?</div><div><br></div><div>If the IR will be multi-threaded by default (otherwise a flag for a</div><div>module would be needed), then I'm thinking that we would need a more</div><div>fundamental change to the IR to represent this, especially since</div><div>convergence can be specified without the presence of cross-lane</div><div>functions.</div><div><br></div><div>>>> It's certainly interesting to think about how to maintain correctness</div><div>>>> in the face of ballots() with such a pass, but a) it's certainly no</div><div>>>> harder with merge intrinsics than merges being implicit and b) I doubt</div><div>>>> that's useful for anything you'd want to do with a GPU.</div><div>>> </div><div>>> Irreducible control flow has to be handled somehow, and linearization</div><div>>> is the only transform I know of, that will handle everything. I'm not</div><div>>> sure what the execution model says about irreducible control flow.</div><div>></div><div>> For what it's worth, the reconverging CFG approach can also handle </div><div>> arbitrary irreducible control flow.</div><div>></div><div>> Whether (or how) it can do this while being compatible with whatever </div><div>> semantics we come up with for cross-lane operations remains to be seen.</div><div><br></div><div>I am wondering about execution order and if that is part of the specification?</div><div>For example:</div><div><br></div><div>if (A) {</div><div>L1:</div><div> ballot();</div><div> if (B)</div><div> goto L2;</div><div> else </div><div> goto END:</div><div>} else {</div><div>L2:</div><div> ballot();</div><div> if (C)</div><div> goto L1;</div><div> else</div><div> goto END;</div><div>}</div><div>END:</div><div><br></div><div>Should the L1 ballot be executed first, or the L2, or does it not</div><div>matter? I'm not sure what the ballots should return. Would swapping</div><div>the branches from A (use !A instead of A) be a legal transform?</div><div><br></div><div>- Jan</div><div><br></div></span><br></div><div><br></div>
</div><div id="yahoo_quoted_9925012874" class="yahoo_quoted">
<div style="font-family:'Helvetica Neue', Helvetica, Arial, sans-serif;font-size:13px;color:#26282a;">
<div>
On Friday, February 1, 2019, 4:57:54 PM EST, Nicolai Hähnle <nhaehnle@gmail.com> wrote:
</div>
<div><br></div>
<div><br></div>
<div><div dir="ltr">On 31.01.19 15:59, Jan Sjodin wrote:<br></div><div dir="ltr">>> > Any transform that re-arranges control flow would potentially have to<br></div><div dir="ltr">>> > know about the properties of ballot(), and the rules with respect to<br></div><div dir="ltr">>> > the CFG (and maybe consider the target) to know where to insert the<br></div><div dir="ltr">>> > intrinsics.<br></div><div dir="ltr">> <br></div><div dir="ltr">>> But the same is true for basically any approach to handling this. In<br></div><div dir="ltr">>> fact, adding the merge intrinsics makes this much easier. Beyond the<br></div><div dir="ltr">>> usual problems with hoisting ballots, which passes currently avoid via<br></div><div dir="ltr">>> the current convergent + speculatable attributes, we'd only have to<br></div><div dir="ltr">>> add awareness to passes that they can't duplicate/combine merge<br></div><div dir="ltr">>> intrinsics or move them past convergent intrinsics, which is a local<br></div><div dir="ltr">>> property and hence easily checkable. In example I explained, without<br></div><div dir="ltr">>> some kind of merge intrinsic, tail duplication has to look at the<br></div><div dir="ltr">>> entire loop to know whether the transform is legal. Of course, you<br></div><div dir="ltr">>> could have some kind of "no convergent calls" flag on a function, but<br></div><div dir="ltr">>> that doesn't eliminate the nastyness when there are convergent calls.<br></div><div dir="ltr">> <br></div><div dir="ltr">> We will have to determine if the intrinsics are worth more compared to<br></div><div dir="ltr">> scanning the code.<br></div><div dir="ltr"><br></div><div dir="ltr">The reason I'm looking for solutions that can work without "scanning the <br></div><div dir="ltr">code" or "spooky action at a distance" is that we should have a solution <br></div><div dir="ltr">that's easily digestible by folks who are not aware of GPU execution models.<br></div><div dir="ltr"><br></div><div dir="ltr">The fallback model for such folks should be: if your pass doesn't touch <br></div><div dir="ltr">specially-marked calls (whether that marker is "convergent" or something <br></div><div dir="ltr">else), then you don't have to worry. Which, as a corollary, means that <br></div><div dir="ltr">transforms can be conservatively correct by not touching such calls.<br></div><div dir="ltr"><br></div><div dir="ltr">I don't see how to achieve this goal while forcing passes to do specific <br></div><div dir="ltr">code scans.<br></div><div dir="ltr"><br></div><div dir="ltr"><br></div><div dir="ltr">>> > I had the impression that the control flow convergence was<br></div><div dir="ltr">>> > in part specified by what the target architecture can handle.<br></div><div dir="ltr">> <br></div><div dir="ltr">>> This is true, although hopefully we can agree on something that<br></div><div dir="ltr">>> everyone can implement.<br></div><div dir="ltr">> <br></div><div dir="ltr">> Yes, hopefully there is an execution model that is works for everyone.<br></div><div dir="ltr">> <br></div><div dir="ltr">>> > One of<br></div><div dir="ltr">>> > the more complicated cases would be linearization where the control<br></div><div dir="ltr">>> > flow is completely rewritten, and is encoded in a variable that says<br></div><div dir="ltr">>> > which basic block is the next one to execute.<br></div><div dir="ltr">> <br></div><div dir="ltr">>> It's certainly interesting to think about how to maintain correctness<br></div><div dir="ltr">>> in the face of ballots() with such a pass, but a) it's certainly no<br></div><div dir="ltr">>> harder with merge intrinsics than merges being implicit and b) I doubt<br></div><div dir="ltr">>> that's useful for anything you'd want to do with a GPU.<br></div><div dir="ltr">> <br></div><div dir="ltr">> Irreducible control flow has to be handled somehow, and linearization<br></div><div dir="ltr">> is the only transform I know of, that will handle everything. I'm not<br></div><div dir="ltr">> sure what the execution model says about irreducible control flow.<br></div><div dir="ltr"><br></div><div dir="ltr">For what it's worth, the reconverging CFG approach can also handle <br></div><div dir="ltr">arbitrary irreducible control flow.<br></div><div dir="ltr"><br></div><div dir="ltr">Whether (or how) it can do this while being compatible with whatever <br></div><div dir="ltr">semantics we come up with for cross-lane operations remains to be seen.<br></div><div dir="ltr"><br></div><div dir="ltr"><br></div><div dir="ltr">>> > Another case is DCE,<br></div><div dir="ltr">>> > where a ballot() could be eliminated, and it would potentially have to<br></div><div dir="ltr">>> > remove a number of intrinsics to enable later optimizations (unless it<br></div><div dir="ltr">>> > would affect performance?), so the intrinsics will require some<br></div><div dir="ltr">>> > non-local updates.<br></div><div dir="ltr">> <br></div><div dir="ltr">>> Removing merge intrinsics if it's profitable and allowed is a<br></div><div dir="ltr">>> separate optimization, that can be done relatively quickly in a single<br></div><div dir="ltr">>> pass without impacting the performance of other optimizations. It's<br></div><div dir="ltr">>> requiring expensive non-local checks in optimizations which modify<br></div><div dir="ltr">>> control flow that's a problem.<br></div><div dir="ltr">> <br></div><div dir="ltr">> There are certainly a lot of tradeoffs. My point was simply that the<br></div><div dir="ltr">> intrinsics are not strictly local.<br></div><div dir="ltr"><br></div><div dir="ltr">Right. I have some ideas for how to fix this, but wasn't able to work on <br></div><div dir="ltr">them for the last weeks.<br></div><div dir="ltr"><br></div><div dir="ltr">Cheers,<br></div><div dir="ltr">Nicolai<br></div><div dir="ltr"><br></div><div dir="ltr"><br></div><div dir="ltr">>> > So we might have them without a ballot(), which would seem to make it<br></div><div dir="ltr">>> > more difficult for structurizers or other transforms to maintain the<br></div><div dir="ltr">>> > semantics and insert intrinsics.<br></div><div dir="ltr">> <br></div><div dir="ltr">>> It's not any more difficult code-wise, since the case where there is<br></div><div dir="ltr">>> a ballot needs to be handled anyways. And while it might take longer<br></div><div dir="ltr">>> to process stuff when this hypothetical pass encounters a merge<br></div><div dir="ltr">>> intrinsic (I can't think of a real-world case where it would matter),<br></div><div dir="ltr">>> it should result in better code generated.<br></div><div dir="ltr">> <br></div><div dir="ltr">> I was thinking of linearization or something similar, where an<br></div><div dir="ltr">> instrinsic by itself might be hard to preserve, compared to looking at<br></div><div dir="ltr">> a ballot() and derive where intrinsics should be inserted.<br></div><div dir="ltr">> <br></div><div dir="ltr">>> > > > How are uniform branches handled? Do they affect the convergence model?<br></div><div dir="ltr">>> > > ><br></div><div dir="ltr">>> > > We may be able to remove convergence points if branches are uniform.<br></div><div dir="ltr">>> > > In Nicolai's proposal, I believe we'd want to remove a merge intrinsic<br></div><div dir="ltr">>> > > when all the branches that it post-dominates that aren't also<br></div><div dir="ltr">>> > > post-dominated by some other merge intrinsic are uniform.<br></div><div dir="ltr">>> ><br></div><div dir="ltr">>> > I couldn't quite understand the last sentence, but I assume the<br></div><div dir="ltr">>> > conditions would prevent removing convergence points that help<br></div><div dir="ltr">>> > performance. Post-domination might not be adequate if there are loops<br></div><div dir="ltr">>> > btw.<br></div><div dir="ltr">>> <br></div><div dir="ltr">>> It should be -- Nicolai's proposal is that a merge intrinsic merges<br></div><div dir="ltr">>> all the divergences caused by branches post dominated by the<br></div><div dir="ltr">>> intrinsic, so if all the divergent branches are merged by some other<br></div><div dir="ltr">>> intrinsic earlier, then there's no divergence and the merge intrinsic<br></div><div dir="ltr">>> is a no-op.<br></div><div dir="ltr">> <br></div><div dir="ltr">> Makes sense.<br></div><div dir="ltr">> <br></div><div dir="ltr">> - Jan<br></div><div dir="ltr">> On Wednesday, January 30, 2019, 11:41:29 AM EST, Connor Abbott <br></div><div dir="ltr">> <<a ymailto="mailto:cwabbott0@gmail.com" href="mailto:cwabbott0@gmail.com">cwabbott0@gmail.com</a>> wrote:<br></div><div dir="ltr">> <br></div><div dir="ltr">> <br></div><div dir="ltr">> On Wed, Jan 30, 2019 at 4:20 PM Jan Sjodin <<a ymailto="mailto:jan_sjodin@yahoo.com" href="mailto:jan_sjodin@yahoo.com">jan_sjodin@yahoo.com</a> <br></div><div dir="ltr">> <mailto:<a ymailto="mailto:jan_sjodin@yahoo.com" href="mailto:jan_sjodin@yahoo.com">jan_sjodin@yahoo.com</a>>> wrote:<br></div><div dir="ltr">> <br></div><div dir="ltr">> <br></div><div dir="ltr">> ><br></div><div dir="ltr">> > > > for (int i = 0; i < 2; i++) {<br></div><div dir="ltr">> > > > foo = ballot(true); // ballot 1<br></div><div dir="ltr">> > > ><br></div><div dir="ltr">> > > > if (threadID /* ID of the thread within a wavefront/warp */ % 2 == 0) continue;<br></div><div dir="ltr">> > > ><br></div><div dir="ltr">> > > > bar = ballot(true); // ballot 2<br></div><div dir="ltr">> > > > }<br></div><div dir="ltr">> > > ><br></div><div dir="ltr">> > > > versus:<br></div><div dir="ltr">> > > ><br></div><div dir="ltr">> > > > int i = 0;<br></div><div dir="ltr">> > > > while (true) {<br></div><div dir="ltr">> > > > do {<br></div><div dir="ltr">> > > > if (i == 2) break;<br></div><div dir="ltr">> > > > foo = ballot(true); // ballot 1<br></div><div dir="ltr">> > > > i++;<br></div><div dir="ltr">> > > > } while (threadID % 2 == 0);<br></div><div dir="ltr">> > > ><br></div><div dir="ltr">> > > > if (i == 2) break;<br></div><div dir="ltr">> > > > bar = ballot(true); // ballot 2<br></div><div dir="ltr">> > > > i++;<br></div><div dir="ltr">> > > > }<br></div><div dir="ltr">> > ><br></div><div dir="ltr">> > > I think you can remove the second "i++", otherwise we can increment "i" twice<br></div><div dir="ltr">> > > if threadID % 2 != 0, but I see the issue. Using the equivalence classes would<br></div><div dir="ltr">> > > prevent this transform, since we have re-arranged the control flow in that way,<br></div><div dir="ltr">> ><br></div><div dir="ltr">> > What do you mean? Note that the transforms that lead from the first<br></div><div dir="ltr">> > example to the second are actually desirable if there aren't any<br></div><div dir="ltr">> > ballots or other convergent operations, so banning them entirely is a<br></div><div dir="ltr">> > no-go. Otherwise, note that the ballot here can be nested arbitrarily<br></div><div dir="ltr">> > deep, which means that jump forwarding/tail duplication has to be<br></div><div dir="ltr">> > aware of the entire loop unless we have some kind of merge intrinsic.<br></div><div dir="ltr">> <br></div><div dir="ltr">> Yes, if there weren't any calls to a ballot, then the transform would<br></div><div dir="ltr">> be legal. What I was saying in the beginning was that ballot() would<br></div><div dir="ltr">> have some special rules attached to it. It is of course undesirable to<br></div><div dir="ltr">> have flags to enforce correctness, but that is the point we are at<br></div><div dir="ltr">> right now.<br></div><div dir="ltr">> <br></div><div dir="ltr">> > Also, I should say that while interesting, control equivalence classes<br></div><div dir="ltr">> > can be part of the *implementation*, but not the *specification*. We<br></div><div dir="ltr">> > need<br></div><div dir="ltr">> > to define the semantics of the IR -- that is, how a program<br></div><div dir="ltr">> > compiled from any given IR is allowed to do when executed (and *not*<br></div><div dir="ltr">> > what it looks like when compiled) -- *first*, and then which<br></div><div dir="ltr">> > transforms are allowed/not allowed will fall out of that. We can't<br></div><div dir="ltr">> > start by listing disallowed transforms, because then when someone<br></div><div dir="ltr">> > comes along and writes a new optimization, it might be technically<br></div><div dir="ltr">> > "allowed" even though it breaks something in practice.<br></div><div dir="ltr">> <br></div><div dir="ltr">> I think you misunderstand me if you think I was listing disallowed<br></div><div dir="ltr">> transforms. My question was if it is possible to have multi-threaded<br></div><div dir="ltr">> semantics in the source language, but in the compiler have a<br></div><div dir="ltr">> single-threaded view, where some properties of the CFG would determine<br></div><div dir="ltr">> what is legal and not for some functions with a special flag.<br></div><div dir="ltr">> <br></div><div dir="ltr">> <br></div><div dir="ltr">> But that's practically the same as listing allowed and disallowed <br></div><div dir="ltr">> transforms -- you're defining what the final IR is allowed to look like, <br></div><div dir="ltr">> not how it's allowed to execute. If at all possible, the former should <br></div><div dir="ltr">> be derived from the latter.<br></div><div dir="ltr">> <br></div><div dir="ltr">> I agree<br></div><div dir="ltr">> it is more desirable to have the the semantics specified in the<br></div><div dir="ltr">> IR. However, I am exploring this from a practical point of view, since<br></div><div dir="ltr">> ballot() is very rare compared to all code that is being compiled for<br></div><div dir="ltr">> all targets. These intrinsics would always have to be considered when<br></div><div dir="ltr">> writing a pass. They seem to me harder to think about, and<br></div><div dir="ltr">> test, for someone who is working on a single-threaded target, compared<br></div><div dir="ltr">> to looking at a flag. If we allow ourselves to look at which<br></div><div dir="ltr">> transforms that might violate these properties, we could would free up<br></div><div dir="ltr">> the rest of the compiler (and developers) to not have to worry about<br></div><div dir="ltr">> these things. Intrinsics would have to be maintained throughout the<br></div><div dir="ltr">> entire compilation process in every pass.<br></div><div dir="ltr">> <br></div><div dir="ltr">> > The only way to<br></div><div dir="ltr">> > conclusively prove that transforms will never break code that's<br></div><div dir="ltr">> > supposed to work (except for bugs in the transforms) is to define the<br></div><div dir="ltr">> > semantics of the IR, and then to make sure that it refines the<br></div><div dir="ltr">> > semantics of the source language. This is why the LangRef almost never<br></div><div dir="ltr">> > talks about allowed/disallowed transforms except as examples to<br></div><div dir="ltr">> > explain some given semantics, and if you don't follow that rule, your<br></div><div dir="ltr">> > patch will probably be rejected.<br></div><div dir="ltr">> <br></div><div dir="ltr">> I'm trying to figure out if we are in the "almost never" territory<br></div><div dir="ltr">> here or not.<br></div><div dir="ltr">> <br></div><div dir="ltr">> <br></div><div dir="ltr">> I don't think so at all. In addition to being much, much harder to <br></div><div dir="ltr">> reason about and prove that the approach is sound, I think it'll be more <br></div><div dir="ltr">> intrusive.<br></div><div dir="ltr">> <br></div><div dir="ltr">> <br></div><div dir="ltr">> > Now, to define a semantics for ballot(), we need to define what it's<br></div><div dir="ltr">> > allowed to return for any given execution of any given program, and to<br></div><div dir="ltr">> > do that, we need to define which threads must be active together when<br></div><div dir="ltr">> > it's reached, which in turns means we need to define when control flow<br></div><div dir="ltr">> > can re-converge somehow. Any proposal for how to handle ballot() must<br></div><div dir="ltr">> > start here.<br></div><div dir="ltr">> <br></div><div dir="ltr">> > > I'm not sure if using these rules will be easier or harder than dealing with<br></div><div dir="ltr">> > > intrinsics. One problem is that the rules for single-threaded code might seem<br></div><div dir="ltr">> > > arbitrary, and it would be hard to reason about them in a larger context.<br></div><div dir="ltr">> <br></div><div dir="ltr">> > > What happens to new control flow created by transforms, and what will guide<br></div><div dir="ltr">> > > the insertion of intrinsics in the new code? Code structurization is one example<br></div><div dir="ltr">> > > were this could happen.<br></div><div dir="ltr">> <br></div><div dir="ltr">> > Hopefully, it's clear from the above how this all falls out. The<br></div><div dir="ltr">> > frontend for e.g. GLSL or SPIR-V would have to insert the merge<br></div><div dir="ltr">> > intrinsics to preserve the semantics of the source language. Any<br></div><div dir="ltr">> > transforms must refine the semantics of the IR, although I can't think<br></div><div dir="ltr">> > of a scenario where that would involve emitting any new merge<br></div><div dir="ltr">> > intrinsics. Usually, structurized IR's have their own semantics about<br></div><div dir="ltr">> > when control flow converges, so a code structurizer should respect the<br></div><div dir="ltr">> > original semantics. AMDGPU has its own code structurizer that runs<br></div><div dir="ltr">> > very late in the pipeline (although there are plans to remove it), so<br></div><div dir="ltr">> > we might have to change that to make it respect the merge intrinsic<br></div><div dir="ltr">> > intrinsics, and then we'll correctly implement them "for free".<br></div><div dir="ltr">> <br></div><div dir="ltr">> Any transform that re-arranges control flow would potentially have to<br></div><div dir="ltr">> know about the properties of ballot(), and the rules with respect to<br></div><div dir="ltr">> the CFG (and maybe consider the target) to know where to insert the<br></div><div dir="ltr">> intrinsics.<br></div><div dir="ltr">> <br></div><div dir="ltr">> <br></div><div dir="ltr">> But the same is true for basically any approach to handling this. In <br></div><div dir="ltr">> fact, adding the merge intrinsics makes this much easier. Beyond the <br></div><div dir="ltr">> usual problems with hoisting ballots, which passes currently avoid via <br></div><div dir="ltr">> the current convergent + speculatable attributes, we'd only have to add <br></div><div dir="ltr">> awareness to passes that they can't duplicate/combine merge intrinsics <br></div><div dir="ltr">> or move them past convergent intrinsics, which is a local property and <br></div><div dir="ltr">> hence easily checkable. In example I explained, without some kind of <br></div><div dir="ltr">> merge intrinsic, tail duplication has to look at the entire loop to know <br></div><div dir="ltr">> whether the transform is legal. Of course, you could have some kind of <br></div><div dir="ltr">> "no convergent calls" flag on a function, but that doesn't eliminate the <br></div><div dir="ltr">> nastyness when there are convergent calls.<br></div><div dir="ltr">> <br></div><div dir="ltr">> I had the impression that the control flow convergence was<br></div><div dir="ltr">> in part specified by what the target architecture can handle.<br></div><div dir="ltr">> <br></div><div dir="ltr">> <br></div><div dir="ltr">> This is true, although hopefully we can agree on something that everyone <br></div><div dir="ltr">> can implement.<br></div><div dir="ltr">> <br></div><div dir="ltr">> One of<br></div><div dir="ltr">> the more complicated cases would be linearization where the control<br></div><div dir="ltr">> flow is completely rewritten, and is encoded in a variable that says<br></div><div dir="ltr">> which basic block is the next one to execute.<br></div><div dir="ltr">> <br></div><div dir="ltr">> <br></div><div dir="ltr">> It's certainly interesting to think about how to maintain correctness in <br></div><div dir="ltr">> the face of ballots() with such a pass, but a) it's certainly no harder <br></div><div dir="ltr">> with merge intrinsics than merges being implicit and b) I doubt that's <br></div><div dir="ltr">> useful for anything you'd want to do with a GPU.<br></div><div dir="ltr">> <br></div><div dir="ltr">> Another case is DCE,<br></div><div dir="ltr">> where a ballot() could be eliminated, and it would potentially have to<br></div><div dir="ltr">> remove a number of intrinsics to enable later optimizations (unless it<br></div><div dir="ltr">> would affect performance?), so the intrinsics will require some<br></div><div dir="ltr">> non-local updates.<br></div><div dir="ltr">> <br></div><div dir="ltr">> <br></div><div dir="ltr">> Removing merge intrinsics if it's profitable and allowed is a separate <br></div><div dir="ltr">> optimization, that can be done relatively quickly in a single pass <br></div><div dir="ltr">> without impacting the performance of other optimizations. It's requiring <br></div><div dir="ltr">> expensive non-local checks in optimizations which modify control flow <br></div><div dir="ltr">> that's a problem.<br></div><div dir="ltr">> <br></div><div dir="ltr">> <br></div><div dir="ltr">> > > > Would they only be present if ballot and similar functions are used, or do they<br></div><div dir="ltr">> > > > have to be present everywhere?<br></div><div dir="ltr">> > ><br></div><div dir="ltr">> > > They'd only have to be present when ballot or other convergent<br></div><div dir="ltr">> > > functions are called, since otherwise it doesn't matter when control<br></div><div dir="ltr">> > flow re-converges. However, we may want to keep them around for<br></div><div dir="ltr">> > performance reasons (usually destroying convergence points is bad for<br></div><div dir="ltr">> > performance).<br></div><div dir="ltr">> <br></div><div dir="ltr">> So we might have them without a ballot(), which would seem to make it<br></div><div dir="ltr">> more difficult for structurizers or other transforms to maintain the<br></div><div dir="ltr">> semantics and insert intrinsics.<br></div><div dir="ltr">> <br></div><div dir="ltr">> <br></div><div dir="ltr">> It's not any more difficult code-wise, since the case where there is a <br></div><div dir="ltr">> ballot needs to be handled anyways. And while it might take longer to <br></div><div dir="ltr">> process stuff when this hypothetical pass encounters a merge intrinsic <br></div><div dir="ltr">> (I can't think of a real-world case where it would matter), it should <br></div><div dir="ltr">> result in better code generated.<br></div><div dir="ltr">> <br></div><div dir="ltr">> <br></div><div dir="ltr">> > > How are uniform branches handled? Do they affect the convergence model?<br></div><div dir="ltr">> > ><br></div><div dir="ltr">> > We may be able to remove convergence points if branches are uniform.<br></div><div dir="ltr">> > In Nicolai's proposal, I believe we'd want to remove a merge intrinsic<br></div><div dir="ltr">> > when all the branches that it post-dominates that aren't also<br></div><div dir="ltr">> > post-dominated by some other merge intrinsic are uniform.<br></div><div dir="ltr">> <br></div><div dir="ltr">> I couldn't quite understand the last sentence, but I assume the<br></div><div dir="ltr">> conditions would prevent removing convergence points that help<br></div><div dir="ltr">> performance. Post-domination might not be adequate if there are loops<br></div><div dir="ltr">> btw.<br></div><div dir="ltr">> <br></div><div dir="ltr">> <br></div><div dir="ltr">> It should be -- Nicolai's proposal is that a merge intrinsic merges all <br></div><div dir="ltr">> the divergences caused by branches post dominated by the intrinsic, so <br></div><div dir="ltr">> if all the divergent branches are merged by some other intrinsic <br></div><div dir="ltr">> earlier, then there's no divergence and the merge intrinsic is a no-op.<br></div><div dir="ltr">> <br></div><div dir="ltr">> <br></div><div dir="ltr">> - Jan<br></div><div dir="ltr">> <br></div><div dir="ltr">> <br></div><div dir="ltr">> <br></div><div dir="ltr">> On Wednesday, January 30, 2019, 6:29:52 AM EST, Connor Abbott<br></div><div dir="ltr">> <<a ymailto="mailto:cwabbott0@gmail.com" href="mailto:cwabbott0@gmail.com">cwabbott0@gmail.com</a> <mailto:<a ymailto="mailto:cwabbott0@gmail.com" href="mailto:cwabbott0@gmail.com">cwabbott0@gmail.com</a>>> wrote:<br></div><div dir="ltr">> <br></div><div dir="ltr">> <br></div><div dir="ltr">> On Mon, Jan 28, 2019 at 9:09 PM Jan Sjodin <<a ymailto="mailto:jan_sjodin@yahoo.com" href="mailto:jan_sjodin@yahoo.com">jan_sjodin@yahoo.com</a><br></div><div dir="ltr">> <mailto:<a ymailto="mailto:jan_sjodin@yahoo.com" href="mailto:jan_sjodin@yahoo.com">jan_sjodin@yahoo.com</a>>> wrote:<br></div><div dir="ltr">> ><br></div><div dir="ltr">> > > for (int i = 0; i < 2; i++) {<br></div><div dir="ltr">> > > foo = ballot(true); // ballot 1<br></div><div dir="ltr">> > ><br></div><div dir="ltr">> > > if (threadID /* ID of the thread within a wavefront/warp */<br></div><div dir="ltr">> % 2 == 0) continue;<br></div><div dir="ltr">> > ><br></div><div dir="ltr">> > > bar = ballot(true); // ballot 2<br></div><div dir="ltr">> > > }<br></div><div dir="ltr">> > ><br></div><div dir="ltr">> > > versus:<br></div><div dir="ltr">> > ><br></div><div dir="ltr">> > > int i = 0;<br></div><div dir="ltr">> > > while (true) {<br></div><div dir="ltr">> > > do {<br></div><div dir="ltr">> > > if (i == 2) break;<br></div><div dir="ltr">> > > foo = ballot(true); // ballot 1<br></div><div dir="ltr">> > > i++;<br></div><div dir="ltr">> > > } while (threadID % 2 == 0);<br></div><div dir="ltr">> > ><br></div><div dir="ltr">> > > if (i == 2) break;<br></div><div dir="ltr">> > > bar = ballot(true); // ballot 2<br></div><div dir="ltr">> > > i++;<br></div><div dir="ltr">> > > }<br></div><div dir="ltr">> ><br></div><div dir="ltr">> > I think you can remove the second "i++", otherwise we can<br></div><div dir="ltr">> increment "i" twice<br></div><div dir="ltr">> > if threadID % 2 != 0, but I see the issue. Using the equivalence<br></div><div dir="ltr">> classes would<br></div><div dir="ltr">> > prevent this transform, since we have re-arranged the control<br></div><div dir="ltr">> flow in that way,<br></div><div dir="ltr">> <br></div><div dir="ltr">> What do you mean? Note that the transforms that lead from the first<br></div><div dir="ltr">> example to the second are actually desirable if there aren't any<br></div><div dir="ltr">> ballots or other convergent operations, so banning them entirely is a<br></div><div dir="ltr">> no-go. Otherwise, note that the ballot here can be nested arbitrarily<br></div><div dir="ltr">> deep, which means that jump forwarding/tail duplication has to be<br></div><div dir="ltr">> aware of the entire loop unless we have some kind of merge intrinsic.<br></div><div dir="ltr">> <br></div><div dir="ltr">> Also, I should say that while interesting, control equivalence classes<br></div><div dir="ltr">> can be part of the *implementation*, but not the *specification*. We<br></div><div dir="ltr">> need to define the semantics of the IR -- that is, how a program<br></div><div dir="ltr">> compiled from any given IR is allowed to do when executed (and *not*<br></div><div dir="ltr">> what it looks like when compiled) -- *first*, and then which<br></div><div dir="ltr">> transforms are allowed/not allowed will fall out of that. We can't<br></div><div dir="ltr">> start by listing disallowed transforms, because then when someone<br></div><div dir="ltr">> comes along and writes a new optimization, it might be technically<br></div><div dir="ltr">> "allowed" even though it breaks something in practice. The only way to<br></div><div dir="ltr">> conclusively prove that transforms will never break code that's<br></div><div dir="ltr">> supposed to work (except for bugs in the transforms) is to define the<br></div><div dir="ltr">> semantics of the IR, and then to make sure that it refines the<br></div><div dir="ltr">> semantics of the source language. This is why the LangRef almost never<br></div><div dir="ltr">> talks about allowed/disallowed transforms except as examples to<br></div><div dir="ltr">> explain some given semantics, and if you don't follow that rule, your<br></div><div dir="ltr">> patch will probably be rejected.<br></div><div dir="ltr">> <br></div><div dir="ltr">> Now, to define a semantics for ballot(), we need to define what it's<br></div><div dir="ltr">> allowed to return for any given execution of any given program, and to<br></div><div dir="ltr">> do that, we need to define which threads must be active together when<br></div><div dir="ltr">> it's reached, which in turns means we need to define when control flow<br></div><div dir="ltr">> can re-converge somehow. Any proposal for how to handle ballot() must<br></div><div dir="ltr">> start here.<br></div><div dir="ltr">> <br></div><div dir="ltr">> > I'm not sure if using these rules will be easier or harder than<br></div><div dir="ltr">> dealing with<br></div><div dir="ltr">> > intrinsics. One problem is that the rules for single-threaded<br></div><div dir="ltr">> code might seem<br></div><div dir="ltr">> > arbitrary, and it would be hard to reason about them in a larger<br></div><div dir="ltr">> context.<br></div><div dir="ltr">> ><br></div><div dir="ltr">> > > Nicolai's proposal solves this by having the frontend emit a<br></div><div dir="ltr">> merge intrinsic<br></div><div dir="ltr">> > > before the i++ is emitted. This prevents the jump forwarding<br></div><div dir="ltr">> from occurring.<br></div><div dir="ltr">> ><br></div><div dir="ltr">> > I was thinking about getting through the single-thread view and<br></div><div dir="ltr">> the issues with<br></div><div dir="ltr">> > that first, then I will think more about the multi-thread and<br></div><div dir="ltr">> explicit convergence.<br></div><div dir="ltr">> ><br></div><div dir="ltr">> > If we are done with the single-thread stuff for now these are the<br></div><div dir="ltr">> question that I<br></div><div dir="ltr">> > have been thinking about with the multi-threaded view:<br></div><div dir="ltr">> ><br></div><div dir="ltr">> > What happens to new control flow created by transforms, and what<br></div><div dir="ltr">> will guide<br></div><div dir="ltr">> > the insertion of intrinsics in the new code? Code structurization<br></div><div dir="ltr">> is one example<br></div><div dir="ltr">> > were this could happen.<br></div><div dir="ltr">> <br></div><div dir="ltr">> Hopefully, it's clear from the above how this all falls out. The<br></div><div dir="ltr">> frontend for e.g. GLSL or SPIR-V would have to insert the merge<br></div><div dir="ltr">> intrinsics to preserve the semantics of the source language. Any<br></div><div dir="ltr">> transforms must refine the semantics of the IR, although I can't think<br></div><div dir="ltr">> of a scenario where that would involve emitting any new merge<br></div><div dir="ltr">> intrinsics. Usually, structurized IR's have their own semantics about<br></div><div dir="ltr">> when control flow converges, so a code structurizer should respect the<br></div><div dir="ltr">> original semantics. AMDGPU has its own code structurizer that runs<br></div><div dir="ltr">> very late in the pipeline (although there are plans to remove it), so<br></div><div dir="ltr">> we might have to change that to make it respect the merge intrinsic<br></div><div dir="ltr">> intrinsics, and then we'll correctly implement them "for free".<br></div><div dir="ltr">> <br></div><div dir="ltr">> ><br></div><div dir="ltr">> > Would they only be present if ballot and similar functions are<br></div><div dir="ltr">> used, or do they<br></div><div dir="ltr">> > have to be present everywhere?<br></div><div dir="ltr">> <br></div><div dir="ltr">> They'd only have to be present when ballot or other convergent<br></div><div dir="ltr">> functions are called, since otherwise it doesn't matter when control<br></div><div dir="ltr">> flow re-converges. However, we may want to keep them around for<br></div><div dir="ltr">> performance reasons (usually destroying convergence points is bad for<br></div><div dir="ltr">> performance).<br></div><div dir="ltr">> <br></div><div dir="ltr">> ><br></div><div dir="ltr">> > How are uniform branches handled? Do they affect the convergence<br></div><div dir="ltr">> model?<br></div><div dir="ltr">> <br></div><div dir="ltr">> We may be able to remove convergence points if branches are uniform.<br></div><div dir="ltr">> In Nicolai's proposal, I believe we'd want to remove a merge intrinsic<br></div><div dir="ltr">> when all the branches that it post-dominates that aren't also<br></div><div dir="ltr">> post-dominated by some other merge intrinsic are uniform.<br></div><div dir="ltr">> <br></div><div dir="ltr">> <br></div><div dir="ltr">> <br></div><div dir="ltr">> ><br></div><div dir="ltr">> ><br></div><div dir="ltr">> > - Jan<br></div><div dir="ltr">> ><br></div><div dir="ltr">> ><br></div><div dir="ltr">> > On Monday, January 28, 2019, 11:16:36 AM EST, Connor Abbott<br></div><div dir="ltr">> <<a ymailto="mailto:cwabbott0@gmail.com" href="mailto:cwabbott0@gmail.com">cwabbott0@gmail.com</a> <mailto:<a ymailto="mailto:cwabbott0@gmail.com" href="mailto:cwabbott0@gmail.com">cwabbott0@gmail.com</a>>> wrote:<br></div><div dir="ltr">> ><br></div><div dir="ltr">> ><br></div><div dir="ltr">> ><br></div><div dir="ltr">> > On Fri, Jan 25, 2019 at 3:05 AM Jan Sjodin <<a ymailto="mailto:jan_sjodin@yahoo.com" href="mailto:jan_sjodin@yahoo.com">jan_sjodin@yahoo.com</a><br></div><div dir="ltr">> <mailto:<a ymailto="mailto:jan_sjodin@yahoo.com" href="mailto:jan_sjodin@yahoo.com">jan_sjodin@yahoo.com</a>>> wrote:<br></div><div dir="ltr">> > ><br></div><div dir="ltr">> > > > for (...) {<br></div><div dir="ltr">> > > > ballot();<br></div><div dir="ltr">> > > > if (... /* non-uniform */) continue;<br></div><div dir="ltr">> > > > }<br></div><div dir="ltr">> > > ><br></div><div dir="ltr">> > > > into<br></div><div dir="ltr">> > > ><br></div><div dir="ltr">> > > > for (...) {<br></div><div dir="ltr">> > > > do {<br></div><div dir="ltr">> > > > ballot();<br></div><div dir="ltr">> > > > } while (... /* non-uniform */);<br></div><div dir="ltr">> > > > }<br></div><div dir="ltr">> > ><br></div><div dir="ltr">> > > I'm not sure if I follow this example, could you and explain a<br></div><div dir="ltr">> bit more?<br></div><div dir="ltr">> > > It looks to me that the condition in the "if" must be false (if the<br></div><div dir="ltr">> > > same condition is used in the while), or we would<br></div><div dir="ltr">> > > call ballot the wrong number of times.<br></div><div dir="ltr">> ><br></div><div dir="ltr">> > Yes, the idea is that the same condition is used in the if and<br></div><div dir="ltr">> the do-while. I think I messed up the example a little... in the<br></div><div dir="ltr">> second snippet, we're supposed to break out of the inner loop if the<br></div><div dir="ltr">> outer loop's exit condition is true. Here's a more concrete example:<br></div><div dir="ltr">> ><br></div><div dir="ltr">> > for (int i = 0; i < 2; i++) {<br></div><div dir="ltr">> > foo = ballot(true); // ballot 1<br></div><div dir="ltr">> ><br></div><div dir="ltr">> > if (threadID /* ID of the thread within a wavefront/warp */ %<br></div><div dir="ltr">> 2 == 0) continue;<br></div><div dir="ltr">> ><br></div><div dir="ltr">> > bar = ballot(true); // ballot 2<br></div><div dir="ltr">> > }<br></div><div dir="ltr">> ><br></div><div dir="ltr">> > versus:<br></div><div dir="ltr">> ><br></div><div dir="ltr">> > int i = 0;<br></div><div dir="ltr">> > while (true) {<br></div><div dir="ltr">> > do {<br></div><div dir="ltr">> > if (i == 2) break;<br></div><div dir="ltr">> > foo = ballot(true); // ballot 1<br></div><div dir="ltr">> > i++;<br></div><div dir="ltr">> > } while (threadID % 2 == 0);<br></div><div dir="ltr">> ><br></div><div dir="ltr">> > if (i == 2) break;<br></div><div dir="ltr">> > bar = ballot(true); // ballot 2<br></div><div dir="ltr">> > i++;<br></div><div dir="ltr">> > }<br></div><div dir="ltr">> ><br></div><div dir="ltr">> > From a single-threaded perspective, these two snippets are<br></div><div dir="ltr">> identical, even if ballot() writes to arbitrary memory. The first<br></div><div dir="ltr">> can easily get transformed to something like the second when LLVM<br></div><div dir="ltr">> decides to duplicate the final i++ through jump forwarding, and then<br></div><div dir="ltr">> re-interprets the loop as two nested loops and splits the loop<br></div><div dir="ltr">> header in two. This is what currently happens with DOOM when we try<br></div><div dir="ltr">> to enable subgroup operations with it. Let's say there are two<br></div><div dir="ltr">> threads in a wavefront. Then the execution trace mandated by SPIR-V<br></div><div dir="ltr">> for the first looks like:<br></div><div dir="ltr">> ><br></div><div dir="ltr">> > thread 0 | thread 1<br></div><div dir="ltr">> > ballot 1 = 0b11 | ballot 1 = 0b11<br></div><div dir="ltr">> > skipped | ballot 2 = 0b10<br></div><div dir="ltr">> > ballot 1 = 0b11 | ballot 1 = 0b11<br></div><div dir="ltr">> > skipped | ballot 2 = 0b10<br></div><div dir="ltr">> ><br></div><div dir="ltr">> > Now, contrast this with the execution trace that programmers<br></div><div dir="ltr">> would expect for the second example:<br></div><div dir="ltr">> ><br></div><div dir="ltr">> > thread 0 | thread 1<br></div><div dir="ltr">> > ballot 1 = 0b11 | ballot 1 = 0b11<br></div><div dir="ltr">> > ballot 1 = 0b01 | skipped<br></div><div dir="ltr">> > skipped | ballot 2 = 0b10<br></div><div dir="ltr">> > skipped | ballot 1 = 0b10<br></div><div dir="ltr">> > skipped | ballot 2 = 0b10<br></div><div dir="ltr">> ><br></div><div dir="ltr">> > Nicolai's proposal solves this by having the frontend emit a<br></div><div dir="ltr">> merge intrinsic before the i++ is emitted. This prevents the jump<br></div><div dir="ltr">> forwarding from occurring.<br></div><div dir="ltr">> ><br></div><div dir="ltr">> ><br></div><div dir="ltr">> > ><br></div><div dir="ltr">> > > About the CSE, when would that be legal? I can imagine with uniform<br></div><div dir="ltr">> > > branches that it could work, but would like to see an example to<br></div><div dir="ltr">> > > fully understand this.<br></div><div dir="ltr">> > ><br></div><div dir="ltr">> > > I agree that it would be more conservative than if we model the<br></div><div dir="ltr">> threading,<br></div><div dir="ltr">> > > but I'm not sure about the cost/benefit. I am mostly curious if<br></div><div dir="ltr">> it is<br></div><div dir="ltr">> > > possible to have a single-thread view or not. Then we would have to<br></div><div dir="ltr">> > > see if it is adequate.<br></div><div dir="ltr">> > ><br></div><div dir="ltr">> > > - Jan<br></div><div dir="ltr">> > ><br></div><div dir="ltr">> > > On Thursday, January 24, 2019, 10:31:47 AM EST, Connor Abbott<br></div><div dir="ltr">> <<a ymailto="mailto:cwabbott0@gmail.com" href="mailto:cwabbott0@gmail.com">cwabbott0@gmail.com</a> <mailto:<a ymailto="mailto:cwabbott0@gmail.com" href="mailto:cwabbott0@gmail.com">cwabbott0@gmail.com</a>>> wrote:<br></div><div dir="ltr">> > ><br></div><div dir="ltr">> > ><br></div><div dir="ltr">> > > I don't see how this would fix the continue vs. nested loop<br></div><div dir="ltr">> problem I<br></div><div dir="ltr">> > > explained earlier. That is, how would this prevent turning:<br></div><div dir="ltr">> > ><br></div><div dir="ltr">> > > for (...) {<br></div><div dir="ltr">> > > ballot();<br></div><div dir="ltr">> > > if (... /* non-uniform */) continue;<br></div><div dir="ltr">> > > }<br></div><div dir="ltr">> > ><br></div><div dir="ltr">> > > into<br></div><div dir="ltr">> > ><br></div><div dir="ltr">> > > for (...) {<br></div><div dir="ltr">> > > do {<br></div><div dir="ltr">> > > ballot();<br></div><div dir="ltr">> > > } while (... /* non-uniform */);<br></div><div dir="ltr">> > > }<br></div><div dir="ltr">> > ><br></div><div dir="ltr">> > > and vice versa? Note that there's no duplication going on here, and<br></div><div dir="ltr">> > > the single-threaded flow of control is exactly the same.<br></div><div dir="ltr">> > ><br></div><div dir="ltr">> > > Another reason this isn't so great is that it prevents e.g. CSE on<br></div><div dir="ltr">> > > ballots that actually should be combined, since you're<br></div><div dir="ltr">> modelling it as<br></div><div dir="ltr">> > > a write. It seems like the optimizer is going to need some special<br></div><div dir="ltr">> > > knowledge of convergent things that fake memory constraints<br></div><div dir="ltr">> can't give<br></div><div dir="ltr">> > > us.<br></div><div dir="ltr">> > ><br></div><div dir="ltr">> > > On Thu, Jan 24, 2019 at 4:06 PM Jan Sjodin<br></div><div dir="ltr">> <<a ymailto="mailto:jan_sjodin@yahoo.com" href="mailto:jan_sjodin@yahoo.com">jan_sjodin@yahoo.com</a> <mailto:<a ymailto="mailto:jan_sjodin@yahoo.com" href="mailto:jan_sjodin@yahoo.com">jan_sjodin@yahoo.com</a>>> wrote:<br></div><div dir="ltr">> > > ><br></div><div dir="ltr">> > > ><br></div><div dir="ltr">> > > > I was looking into ballot() and how if it is possible to keep<br></div><div dir="ltr">> a single-threaded<br></div><div dir="ltr">> > > > view of the code, but add some extra conditions that must<br></div><div dir="ltr">> hold after the<br></div><div dir="ltr">> > > > transformation. I had the initial idea that each call to<br></div><div dir="ltr">> ballot() in a<br></div><div dir="ltr">> > > > single-threaded program can be seen as a partial write to a<br></div><div dir="ltr">> memory<br></div><div dir="ltr">> > > > location, and each location memory location is unique for<br></div><div dir="ltr">> every call site,<br></div><div dir="ltr">> > > > plus there some externally observable side effect. We can<br></div><div dir="ltr">> abstract this<br></div><div dir="ltr">> > > > away by tagging the calls, e.g. by using aliases.<br></div><div dir="ltr">> > > ><br></div><div dir="ltr">> > > > For example:<br></div><div dir="ltr">> > > ><br></div><div dir="ltr">> > > > if (...) {<br></div><div dir="ltr">> > > > foo1 = ballot();<br></div><div dir="ltr">> > > > } else {<br></div><div dir="ltr">> > > > foo2 = ballot();<br></div><div dir="ltr">> > > > }<br></div><div dir="ltr">> > > ><br></div><div dir="ltr">> > > > simply becomes:<br></div><div dir="ltr">> > > ><br></div><div dir="ltr">> > > > if (...) {<br></div><div dir="ltr">> > > > foo1 = ballot_1();<br></div><div dir="ltr">> > > > } else {<br></div><div dir="ltr">> > > > foo2 = ballot_2();<br></div><div dir="ltr">> > > > }<br></div><div dir="ltr">> > > ><br></div><div dir="ltr">> > > ><br></div><div dir="ltr">> > > > and<br></div><div dir="ltr">> > > ><br></div><div dir="ltr">> > > > if (...) {<br></div><div dir="ltr">> > > > } else {<br></div><div dir="ltr">> > > > }<br></div><div dir="ltr">> > > > ballot();<br></div><div dir="ltr">> > > ><br></div><div dir="ltr">> > > > becomes<br></div><div dir="ltr">> > > ><br></div><div dir="ltr">> > > > if (...) {<br></div><div dir="ltr">> > > > } else {<br></div><div dir="ltr">> > > > }<br></div><div dir="ltr">> > > > ballot_1();<br></div><div dir="ltr">> > > ><br></div><div dir="ltr">> > > > In the first case it would prevent combining the two calls<br></div><div dir="ltr">> into one<br></div><div dir="ltr">> > > > after the if. In the second example there is generally<br></div><div dir="ltr">> nothing that<br></div><div dir="ltr">> > > > says it could not be transformed into the first example with two<br></div><div dir="ltr">> > > > calls to ballot_1(), which should not be allowed.<br></div><div dir="ltr">> > > ><br></div><div dir="ltr">> > > > Another form of duplication that we must allow are loop<br></div><div dir="ltr">> transforms,<br></div><div dir="ltr">> > > > like unrolling or peeling. These might seem similar to the<br></div><div dir="ltr">> example<br></div><div dir="ltr">> > > > above, since we are cloning code and with conditions etc. But<br></div><div dir="ltr">> > > > they are different since they calls are in different loop<br></div><div dir="ltr">> iterations.<br></div><div dir="ltr">> > > ><br></div><div dir="ltr">> > > > The condition that needs to be met is that:<br></div><div dir="ltr">> > > ><br></div><div dir="ltr">> > > > There must be a single path between all cloned ballot_n()<br></div><div dir="ltr">> functions.<br></div><div dir="ltr">> > > ><br></div><div dir="ltr">> > > > The reason for this condition is that if we clone the same<br></div><div dir="ltr">> call, then<br></div><div dir="ltr">> > > > the copies must be mutually exclusive, but if they are cloned<br></div><div dir="ltr">> from<br></div><div dir="ltr">> > > > a loop, there must be a path, or we would skip iterations.<br></div><div dir="ltr">> > > ><br></div><div dir="ltr">> > > > If we want to be more sophisticated we can add:<br></div><div dir="ltr">> > > ><br></div><div dir="ltr">> > > > If there is no such path, the calls must be separated by<br></div><div dir="ltr">> uniform branches.<br></div><div dir="ltr">> > > ><br></div><div dir="ltr">> > > > After the transform everything should be re-tagged, since we<br></div><div dir="ltr">> already<br></div><div dir="ltr">> > > > checked the calls and we don't want to check them again.<br></div><div dir="ltr">> Also, not all<br></div><div dir="ltr">> > > > transforms need (or should) have the tagging checked. One<br></div><div dir="ltr">> example is<br></div><div dir="ltr">> > > > inlining, where multiple copies are created, but they are<br></div><div dir="ltr">> clearly different<br></div><div dir="ltr">> > > > calls. The tagging can be done temporarily for a single pass,<br></div><div dir="ltr">> and then<br></div><div dir="ltr">> > > > eliminated. This could be a good tool for debugging as well,<br></div><div dir="ltr">> since it can<br></div><div dir="ltr">> > > > detect if a transform is suspect.<br></div><div dir="ltr">> > > ><br></div><div dir="ltr">> > > > The code would of course have to make sense as far as control<br></div><div dir="ltr">> flow. If<br></div><div dir="ltr">> > > > we have:<br></div><div dir="ltr">> > > ><br></div><div dir="ltr">> > > > for(;;) {<br></div><div dir="ltr">> > > > if(...) {<br></div><div dir="ltr">> > > > ballot();<br></div><div dir="ltr">> > > > break;<br></div><div dir="ltr">> > > > }<br></div><div dir="ltr">> > > > }<br></div><div dir="ltr">> > > ><br></div><div dir="ltr">> > > > This would have to be translated to:<br></div><div dir="ltr">> > > ><br></div><div dir="ltr">> > > > for(;;) {<br></div><div dir="ltr">> > > > if(...) {<br></div><div dir="ltr">> > > > ballot();<br></div><div dir="ltr">> > > > if(UnknownConstant) { // Not a uniform condition, but<br></div><div dir="ltr">> later on translated to "true"<br></div><div dir="ltr">> > > > break;<br></div><div dir="ltr">> > > > }<br></div><div dir="ltr">> > > > }<br></div><div dir="ltr">> > > ><br></div><div dir="ltr">> > > > However, I think that this is the way the code is generated<br></div><div dir="ltr">> today anyway.<br></div><div dir="ltr">> > > > There would have to be some attribute that indicate that<br></div><div dir="ltr">> these calls (or functions)<br></div><div dir="ltr">> > > > contain ballot or other cross-lane operations so they could<br></div><div dir="ltr">> be tagged and<br></div><div dir="ltr">> > > > checked. The attribute would be used by the passes to know<br></div><div dir="ltr">> that the special<br></div><div dir="ltr">> > > > conditions exist for those calls.<br></div><div dir="ltr">> > > ><br></div><div dir="ltr">> > > > As far as what it means to have a path, it could be complicated.<br></div><div dir="ltr">> > > > For example:<br></div><div dir="ltr">> > > ><br></div><div dir="ltr">> > > > x = ...<br></div><div dir="ltr">> > > > ballot_1();<br></div><div dir="ltr">> > > ><br></div><div dir="ltr">> > > > could be transformed to:<br></div><div dir="ltr">> > > ><br></div><div dir="ltr">> > > > if (x < 4711) {<br></div><div dir="ltr">> > > > ballot_1();<br></div><div dir="ltr">> > > ><br></div><div dir="ltr">> > > > if(x >= 4711) {<br></div><div dir="ltr">> > > > ballot_1();<br></div><div dir="ltr">> > > > }<br></div><div dir="ltr">> > > ><br></div><div dir="ltr">> > > > So a simple path check would say there is a path, and the<br></div><div dir="ltr">> transform is legal,<br></div><div dir="ltr">> > > > but if we examine the conditions, there is no path, and the<br></div><div dir="ltr">> transform should not be legal.<br></div><div dir="ltr">> > > > It could be made even more obscure of course, but I don't see<br></div><div dir="ltr">> any optimizations really<br></div><div dir="ltr">> > > > doing this kind of thing,<br></div><div dir="ltr">> > > ><br></div><div dir="ltr">> > > > - Jan<br></div><div dir="ltr">> > > ><br></div><div dir="ltr">> > > > On Saturday, December 29, 2018, 11:32:25 AM EST, Nicolai<br></div><div dir="ltr">> Hähnle via llvm-dev <<a ymailto="mailto:llvm-dev@lists.llvm.org" href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a><br></div><div dir="ltr">> <mailto:<a ymailto="mailto:llvm-dev@lists.llvm.org" href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>>> wrote:<br></div><div dir="ltr">> > > ><br></div><div dir="ltr">> > > ><br></div><div dir="ltr">> > > > On 20.12.18 18:03, Connor Abbott wrote:<br></div><div dir="ltr">> > > > > We already have the notion of "convergent" functions like<br></div><div dir="ltr">> > > > > syncthreads(), to which we cannot add control-flow<br></div><div dir="ltr">> dependencies.<br></div><div dir="ltr">> > > > > That is, it's legal to hoist syncthreads out of an "if",<br></div><div dir="ltr">> but it's<br></div><div dir="ltr">> > > > > not legal to sink it into an "if". It's not clear to me<br></div><div dir="ltr">> why we<br></div><div dir="ltr">> > > > > can't have "anticonvergent" (terrible name) functions<br></div><div dir="ltr">> which cannot<br></div><div dir="ltr">> > > > > have control-flow dependencies removed from them? <br></div><div dir="ltr">> ballot() would be<br></div><div dir="ltr">> > > > > both convergent and anticonvergent.<br></div><div dir="ltr">> > > > ><br></div><div dir="ltr">> > > > > Would that solve your problem?<br></div><div dir="ltr">> > > > ><br></div><div dir="ltr">> > > > ><br></div><div dir="ltr">> > > > > I think it's important to note that we already have such an<br></div><div dir="ltr">> attribute,<br></div><div dir="ltr">> > > > > although with the opposite sense - it's impossible to<br></div><div dir="ltr">> remove control<br></div><div dir="ltr">> > > > > flow dependencies from a call unless you mark it as<br></div><div dir="ltr">> "speculatable".<br></div><div dir="ltr">> > > ><br></div><div dir="ltr">> > > > This isn't actually true. If both sides of an if/else have<br></div><div dir="ltr">> the same<br></div><div dir="ltr">> > > > non-speculative function call, it can still be moved out of<br></div><div dir="ltr">> control flow.<br></div><div dir="ltr">> > > ><br></div><div dir="ltr">> > > > That's because doing so doesn't change anything at all from a<br></div><div dir="ltr">> > > > single-threaded perspective. Hence why I think we should<br></div><div dir="ltr">> model the<br></div><div dir="ltr">> > > > communication between threads honestly.<br></div><div dir="ltr">> > > ><br></div><div dir="ltr">> > > ><br></div><div dir="ltr">> > > > > However, this doesn't prevent<br></div><div dir="ltr">> > > > ><br></div><div dir="ltr">> > > > > if (...) {<br></div><div dir="ltr">> > > > > } else {<br></div><div dir="ltr">> > > > > }<br></div><div dir="ltr">> > > > > foo = ballot();<br></div><div dir="ltr">> > > > ><br></div><div dir="ltr">> > > > > from being turned into<br></div><div dir="ltr">> > > > ><br></div><div dir="ltr">> > > > > if (...) {<br></div><div dir="ltr">> > > > > foo1 = ballot();<br></div><div dir="ltr">> > > > > } else {<br></div><div dir="ltr">> > > > > foo2 = ballot();<br></div><div dir="ltr">> > > > > }<br></div><div dir="ltr">> > > > > foo = phi(foo1, foo2)<br></div><div dir="ltr">> > > > ><br></div><div dir="ltr">> > > > > and vice versa. We have a "noduplicate" attribute which<br></div><div dir="ltr">> prevents<br></div><div dir="ltr">> > > > > transforming the first into the second, but not the other<br></div><div dir="ltr">> way around. Of<br></div><div dir="ltr">> > > > > course we could keep going this way and add a "nocombine"<br></div><div dir="ltr">> attribute to<br></div><div dir="ltr">> > > > > complement noduplicate. But even then, there are even still<br></div><div dir="ltr">> problematic<br></div><div dir="ltr">> > > > > transforms. For example, take this program, which is<br></div><div dir="ltr">> simplified from a<br></div><div dir="ltr">> > > > > real game that doesn't work with the AMDGPU backend:<br></div><div dir="ltr">> > > > ><br></div><div dir="ltr">> > > > > while (cond1 /* uniform */) {<br></div><div dir="ltr">> > > > > ballot();<br></div><div dir="ltr">> > > > > ...<br></div><div dir="ltr">> > > > > if (cond2 /* non-uniform */) continue;<br></div><div dir="ltr">> > > > > ...<br></div><div dir="ltr">> > > > > }<br></div><div dir="ltr">> > > > ><br></div><div dir="ltr">> > > > > In SPIR-V, when using structured control flow, the<br></div><div dir="ltr">> semantics of this are<br></div><div dir="ltr">> > > > > pretty clearly defined. In particular, there's a continue<br></div><div dir="ltr">> block after<br></div><div dir="ltr">> > > > > the body of the loop where control flow re-converges, and<br></div><div dir="ltr">> the only back<br></div><div dir="ltr">> > > > > edge is from the continue block, so the ballot is in<br></div><div dir="ltr">> uniform control<br></div><div dir="ltr">> > > > > flow. But LLVM will get rid of the continue block since<br></div><div dir="ltr">> it's empty, and<br></div><div dir="ltr">> > > > > re-analyze the loop as two nested loops, splitting the loop<br></div><div dir="ltr">> header in<br></div><div dir="ltr">> > > > > two, producing a CFG which corresponds to this:<br></div><div dir="ltr">> > > > ><br></div><div dir="ltr">> > > > > while (cond1 /* uniform */) {<br></div><div dir="ltr">> > > > > do {<br></div><div dir="ltr">> > > > > ballot();<br></div><div dir="ltr">> > > > > ...<br></div><div dir="ltr">> > > > > } while (cond2 /* non-uniform */);<br></div><div dir="ltr">> > > > > ...<br></div><div dir="ltr">> > > > > }<br></div><div dir="ltr">> > > > ><br></div><div dir="ltr">> > > > > Now, in an implementation where control flow re-converges<br></div><div dir="ltr">> at the<br></div><div dir="ltr">> > > > > immediate post-dominator, this won't do the right thing<br></div><div dir="ltr">> anymore. In<br></div><div dir="ltr">> > > > > order to handle it correctly, you'd effectively need to<br></div><div dir="ltr">> always flatten<br></div><div dir="ltr">> > > > > nested loops, which will probably be really bad for<br></div><div dir="ltr">> performance if the<br></div><div dir="ltr">> > > > > programmer actually wanted the second thing. It also makes<br></div><div dir="ltr">> it impossible<br></div><div dir="ltr">> > > > > when translating a high-level language to LLVM to get the<br></div><div dir="ltr">> "natural"<br></div><div dir="ltr">> > > > > behavior which game developers actually expect. This is<br></div><div dir="ltr">> exactly the sort<br></div><div dir="ltr">> > > > > of "spooky action at a distance" which makes me think that<br></div><div dir="ltr">> everything<br></div><div dir="ltr">> > > > > we've done so far is really insufficient, and we need to<br></div><div dir="ltr">> add an explicit<br></div><div dir="ltr">> > > > > notion of control-flow divergence and reconvergence to the<br></div><div dir="ltr">> IR. We need a<br></div><div dir="ltr">> > > > > way to say that control flow re-converges at the continue<br></div><div dir="ltr">> block, so that<br></div><div dir="ltr">> > > > > LLVM won't eliminate it, and we can vectorize it correctly<br></div><div dir="ltr">> without<br></div><div dir="ltr">> > > > > penalizing cases where it's better for control flow not to<br></div><div dir="ltr">> re-converge.<br></div><div dir="ltr">> > > ><br></div><div dir="ltr">> > > > Well said!<br></div><div dir="ltr">> > > ><br></div><div dir="ltr">> > > > Cheers,<br></div><div dir="ltr">> > > ><br></div><div dir="ltr">> > > > Nicolai<br></div><div dir="ltr">> > > > --<br></div><div dir="ltr">> > > > Lerne, wie die Welt wirklich ist,<br></div><div dir="ltr">> > > > Aber vergiss niemals, wie sie sein sollte.<br></div><div dir="ltr">> > > > _______________________________________________<br></div><div dir="ltr">> > > > LLVM Developers mailing list<br></div><div dir="ltr">> > > > <a ymailto="mailto:llvm-dev@lists.llvm.org" href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a> <mailto:<a ymailto="mailto:llvm-dev@lists.llvm.org" href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>><br></div><div dir="ltr">> > > > <a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" target="_blank">http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><br></div><div dir="ltr">> <br></div><div dir="ltr"><br></div><div dir="ltr"><br></div><div dir="ltr">-- <br></div><div dir="ltr">Lerne, wie die Welt wirklich ist,<br></div><div dir="ltr">Aber vergiss niemals, wie sie sein sollte.<br></div></div>
</div>
</div></body></html>