<div dir="ltr">Hi Justin, Yuan and Vinod,<div><br></div><div>It seems that what __syncthreads() requires in CUDA C++ (as opposed to PTX) is to be executed uniformly across all threads in the block and not just the warp. If so, it would be helpful if there were a precise statement about when a statement is considered to be executed uniformly in CUDA C++. Is there a precise statement somewhere from NVIDIA about this? I haven't found one so far.</div><div><br></div><div>In particular, it's not clear to me at what point diverging threads are considered to have joined up again in CUDA C++. My best guess is that this is at the immediate post-dominator of the statement that starts the divergence, with the caveat that there is an implicit shared CFG node following each return statement in a function.</div><div><br></div><div>Bjarke</div><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Aug 21, 2015 at 4:51 PM, Jingyue Wu <span dir="ltr"><<a href="mailto:jingyue@google.com" target="_blank">jingyue@google.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Looking at <a href="http://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-bar" target="_blank">this section</a> in the PTX ISA, there's a sentence saying: <div><br></div>> In conditionally executed code, a bar instruction should only be used if it is known that all threads evaluate the<div>> condition identically (the warp does not diverge). <div><br></div><div>Does that mean __syncthreads should only be called uniformly when no threads diverge? If so, my <a href="http://sync.cu" target="_blank">sync.cu</a> example is undefined. The reason is that, although every threads reach __syncthreads, they are reaching them divergently:</div><div>1. threads diverge at the "if" statement</div><div>2. the warp runs __syncthreads() with half of the threads enabled</div><div>3. the warp jumps back to the "else" branch</div><div>4. the warp runs __syncthreads() with the other half of the threads enabled</div><div><br></div><div>If my understanding is correct (__syncthreads() can only be called when the warp doesn't diverge), unrolling a loop that contains a __syncthreads() and inlining a function that may call __syncthreads() are fine. Am I right? </div><span><font color="#888888"><div><br></div><div>Jingyue</div><div><br><br></div></font></span></div></div><div><div><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Aug 21, 2015 at 3:11 PM, Jingyue Wu <span dir="ltr"><<a href="mailto:jingyue@google.com" target="_blank">jingyue@google.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">I'm using 7.0. I am attaching the reduced example. <div><br></div><div>nvcc <a href="http://sync.cu" target="_blank">sync.cu</a> -arch=sm_35 -ptx</div><div><br></div><div>gives </div><div><br></div><div><div>        // .globl       _Z3foov</div><div>.visible .entry _Z3foov(</div><div><br></div><div>)</div><div>{</div><div>        .reg .pred      %p<2>;</div><div>        .reg .s32       %r<3>;</div><div><br></div><div><br></div><div>        mov.u32         %r1, %tid.x;</div><div>        and.b32         %r2, %r1, 1;</div><div>        setp.eq.b32     %p1, %r2, 1;</div><div>        @!%p1 bra       BB7_2;</div><div>        bra.uni         BB7_1;</div><div><br></div><div>BB7_1:</div><div>        bar.sync        0;</div><div>        bra.uni         BB7_3;</div><div><br></div><div>BB7_2:</div><div>        bar.sync        0;</div><div><br></div><div>BB7_3:</div><div>        ret;</div><div>}</div></div><div><br></div><div>As you see, bar.sync is duplicated. </div></div><div><div><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Aug 21, 2015 at 1:56 PM, Justin Holewinski <span dir="ltr"><<a href="mailto:jholewinski@nvidia.com" target="_blank">jholewinski@nvidia.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">That’s an interesting case.  AFAIK, inlining should be restricted here for the reason you mention.  Inlining should only be valid if it doesn’t duplicate the barrier.  Which nvcc shows this behavior?<br>

<br>

Adding Vinod and Yuan for comment.<br>

<div><div><br>

> On Aug 21, 2015, at 4:24 PM, Jingyue Wu <<a href="mailto:jingyue@google.com" target="_blank">jingyue@google.com</a>> wrote:<br>

><br>

> Hi Justin,<br>

><br>

> Is a compiler allowed to inline a function that calls __syncthreads? I saw nvcc does that, but not sure it's valid though. For example,<br>

><br>

> void foo() {<br>

>   __syncthreads();<br>

> }<br>

><br>

> if (threadIdx.x % 2 == 0) {<br>

>   ...<br>

>   foo();<br>

> } else {<br>

>   ...<br>

>   foo();<br>

> }<br>

><br>

> Before inlining, all threads meet at one __syncthreads(). After inlining<br>

><br>

> if (threadIdx.x % 2 == 0) {<br>

>   ...<br>

>   __syncthreads();<br>

> } else {<br>

>   ...<br>

>   __syncthreads();<br>

> }<br>

><br>

> The __syncthreads call is duplicated, and it's no longer guaranteed that all threads can meet one __syncthreads().<br>

><br>

> Any thoughts?<br>

><br>

> Jingyue<br>

<br>

<br>

</div></div>-----------------------------------------------------------------------------------<br>

This email message is for the sole use of the intended recipient(s) and may contain<br>

confidential information.  Any unauthorized review, use, disclosure or distribution<br>

is prohibited.  If you are not the intended recipient, please contact the sender by<br>

reply email and destroy all copies of the original message.<br>

-----------------------------------------------------------------------------------<br>

</blockquote></div><br></div>

</div></div></blockquote></div><br></div>

</div></div></blockquote></div><br></div></div>