<div dir="ltr">Hi Justin, Yuan and Vinod,<div><br></div><div>It seems that what __syncthreads() requires in CUDA C++ (as opposed to PTX) is to be executed uniformly across all threads in the block and not just the warp. If so, it would be helpful if there were a precise statement about when a statement is considered to be executed uniformly in CUDA C++. Is there a precise statement somewhere from NVIDIA about this? I haven't found one so far.</div><div><br></div><div>In particular, it's not clear to me at what point diverging threads are considered to have joined up again in CUDA C++. My best guess is that this is at the immediate post-dominator of the statement that starts the divergence, with the caveat that there is an implicit shared CFG node following each return statement in a function.</div><div><br></div><div>Bjarke</div><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Aug 21, 2015 at 4:51 PM, Jingyue Wu <span dir="ltr"><<a href="mailto:jingyue@google.com" target="_blank">jingyue@google.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Looking at <a href="http://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-bar" target="_blank">this section</a> in the PTX ISA, there's a sentence saying: <div><br></div>> In conditionally executed code, a bar instruction should only be used if it is known that all threads evaluate the<div>> condition identically (the warp does not diverge). <div><br></div><div>Does that mean __syncthreads should only be called uniformly when no threads diverge? If so, my <a href="http://sync.cu" target="_blank">sync.cu</a> example is undefined. The reason is that, although every threads reach __syncthreads, they are reaching them divergently:</div><div>1. threads diverge at the "if" statement</div><div>2. the warp runs __syncthreads() with half of the threads enabled</div><div>3. the warp jumps back to the "else" branch</div><div>4. the warp runs __syncthreads() with the other half of the threads enabled</div><div><br></div><div>If my understanding is correct (__syncthreads() can only be called when the warp doesn't diverge), unrolling a loop that contains a __syncthreads() and inlining a function that may call __syncthreads() are fine. Am I right? </div><span><font color="#888888"><div><br></div><div>Jingyue</div><div><br><br></div></font></span></div></div><div><div><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Aug 21, 2015 at 3:11 PM, Jingyue Wu <span dir="ltr"><<a href="mailto:jingyue@google.com" target="_blank">jingyue@google.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">I'm using 7.0. I am attaching the reduced example. <div><br></div><div>nvcc <a href="http://sync.cu" target="_blank">sync.cu</a> -arch=sm_35 -ptx</div><div><br></div><div>gives </div><div><br></div><div><div> // .globl _Z3foov</div><div>.visible .entry _Z3foov(</div><div><br></div><div>)</div><div>{</div><div> .reg .pred %p<2>;</div><div> .reg .s32 %r<3>;</div><div><br></div><div><br></div><div> mov.u32 %r1, %tid.x;</div><div> and.b32 %r2, %r1, 1;</div><div> setp.eq.b32 %p1, %r2, 1;</div><div> @!%p1 bra BB7_2;</div><div> bra.uni BB7_1;</div><div><br></div><div>BB7_1:</div><div> bar.sync 0;</div><div> bra.uni BB7_3;</div><div><br></div><div>BB7_2:</div><div> bar.sync 0;</div><div><br></div><div>BB7_3:</div><div> ret;</div><div>}</div></div><div><br></div><div>As you see, bar.sync is duplicated. </div></div><div><div><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Aug 21, 2015 at 1:56 PM, Justin Holewinski <span dir="ltr"><<a href="mailto:jholewinski@nvidia.com" target="_blank">jholewinski@nvidia.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">That’s an interesting case. AFAIK, inlining should be restricted here for the reason you mention. Inlining should only be valid if it doesn’t duplicate the barrier. Which nvcc shows this behavior?<br>
<br>
Adding Vinod and Yuan for comment.<br>
<div><div><br>
> On Aug 21, 2015, at 4:24 PM, Jingyue Wu <<a href="mailto:jingyue@google.com" target="_blank">jingyue@google.com</a>> wrote:<br>
><br>
> Hi Justin,<br>
><br>
> Is a compiler allowed to inline a function that calls __syncthreads? I saw nvcc does that, but not sure it's valid though. For example,<br>
><br>
> void foo() {<br>
> __syncthreads();<br>
> }<br>
><br>
> if (threadIdx.x % 2 == 0) {<br>
> ...<br>
> foo();<br>
> } else {<br>
> ...<br>
> foo();<br>
> }<br>
><br>
> Before inlining, all threads meet at one __syncthreads(). After inlining<br>
><br>
> if (threadIdx.x % 2 == 0) {<br>
> ...<br>
> __syncthreads();<br>
> } else {<br>
> ...<br>
> __syncthreads();<br>
> }<br>
><br>
> The __syncthreads call is duplicated, and it's no longer guaranteed that all threads can meet one __syncthreads().<br>
><br>
> Any thoughts?<br>
><br>
> Jingyue<br>
<br>
<br>
</div></div>-----------------------------------------------------------------------------------<br>
This email message is for the sole use of the intended recipient(s) and may contain<br>
confidential information. Any unauthorized review, use, disclosure or distribution<br>
is prohibited. If you are not the intended recipient, please contact the sender by<br>
reply email and destroy all copies of the original message.<br>
-----------------------------------------------------------------------------------<br>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div></div>