<div dir="ltr">Hi,<br><div><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Nov 21, 2013 at 8:54 AM, Pekka Jääskeläinen <span dir="ltr"><<a href="mailto:pekka.jaaskelainen@tut.fi" target="_blank">pekka.jaaskelainen@tut.fi</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi all,<div class="im"><br>

<br>

On 11/20/2013 03:54 PM, David Tweed wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Note that in the OpenCL use case, the name precisely describes the intent:<br>

you can move barrier calls around, inline them, etc, if you can show that<br>

the semantics of OpenCL code is the same but (on some particular<br>

architectures) you aren't allowed to duplicate a given barrier call (due to<br>

implementation restrictions) even if otherwise the semantics was ok. This<br>

relates to the LLVM function attribute named noduplicate (as visible in the<br>

patch, obviously)<br>

</blockquote>

<br></div>

While I think 'noduplicate' is a fine workaround for the problem at hand,<br>

I get a feeling it also "throws the baby out with the bath water" a bit,<br>

disallowing some legal optimizations on SPMD programs in the process.<br>

<br></blockquote><div><br>My understanding of what was happening was that a _particular implementation_ of<br></div><div>OpenCL would have the ability to add the noduplicate attribute on its declaration of<br></div><div>

barriers if that's necessary given their implementation of the barrier; if an implementation<br>doesn't need it it can declare barrier without the qualifier.<br><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


AFAIU, in the specific OpenCL case one is safe if one can prove the location<br>

you copy the barrier (or even inject a completely new one) to is non-diverging.<br>

<br></blockquote><div>Just to be clear: this stems from a difference between OpenCL abstract semantics and how<br></div><div>these things might be implemented on some particular compute platforms. OpenCL<br></div><div>just requires that all work-items wait for every work-item to complete at the _same_<br>

barrier. If an implementation can determine the "OpenCL-level" identity of a barrier<br></div><div>even after duplicating the call, it is free to do so and not annotate the barrier prototype.<br></div><div>Some implementations determine barrier identity based upon "the program counter" (in some sense) at the<br>

</div><div>point the barrier call: on these implementations duplicating the call (IF control flow diverges, as<br>you point out) breaks the implementation.<br></div><div><br></div><div>In terms of non-diverging flow, isn't that the case where either it's statically ascertainable<br>

what the control flow is so you aren't duplicating the<br>call "in the final code" since the not taken branch is removed, or there's a <br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


Can you give an example of a case where one cannot duplicate<br>

a barrier call if the control dependencies at the duplicated<br>

barrier call site do not change per work-item?<br></blockquote><div> <br></div><div>No.<br></div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Why and how could some architecture restrict that? AFAIU, it should not be<br>

able to differentiate the copy from any other (user written) barrier,<br>

as "additional synchronization" should be safe in that case.<br>

<br>

E.g.:<br>

<br>

for (uniform_loop) {<br>

  if (uniform_cond) {<br>

     do_something;<br>

  } else {<br>

     do_something_else;<br>

  }<br>

  barrier();<br>

}<br>

<br>

Loop unswitching here might produce:<br>

<br>

if (uniform_variable) {<br>

  for (uniform_loop) {<br>

     do_something;<br>

     barrier();<br>

  }<br>

} else {<br>

  for (uniform_loop) {<br>

     do_something_else;<br>

     barrier();<br>

  }<br>

}<br>

<br>

These two new loops might be more easily horizontally or<br>

vertically parallelized (e.g. vectorized) and the kernel<br>

semantics is still correct, right?<span class="HOEnZb"><font color="#888888"><br></font></span></blockquote><div><br></div><div>If the first example was written such that the inner condition was actually based upon<br>an unknowable-but-uniform variable, then I can't see an issue with that. The question is how often<br>

</div><div>one gets a condition which is a uniform variable which doesn't turn out to be trivially determined<br></div><div>so that dead code elimination which means the end code has only one barrier. (The current implementation<br>

of things, AIUI, is purely local so that one can't have a chain of transformations which temporarily duplicate<br></div><div>such a noduplicate call before deleting one later, but that's more of an issue with the chain-of-transformations-each-valid<br>

approach than the noduplicate attribute. But that's a much bigger problem)<br></div><br></div>-- <br><div>cheers, dave tweed__________________________</div><div>high-performance computing and machine vision expert: <a href="mailto:david.tweed@gmail.com" target="_blank">david.tweed@gmail.com</a></div>

<div>"while having code so boring anyone can maintain it, use Python." -- attempted insult seen on slashdot</div><div> </div>

</div></div></div>