[llvm-dev] [CUDA/NVPTX] is inlining __syncthreads allowed?

Fri Aug 21 16:51:27 PDT 2015

Looking at this section
<http://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-bar>
in
the PTX ISA, there's a sentence saying:

> In conditionally executed code, a bar instruction should only be used if
it is known that all threads evaluate the
> condition identically (the warp does not diverge).

Does that mean __syncthreads should only be called uniformly when no
threads diverge? If so, my sync.cu example is undefined. The reason is
that, although every threads reach __syncthreads, they are reaching them
divergently:
1. threads diverge at the "if" statement
2. the warp runs __syncthreads() with half of the threads enabled
3. the warp jumps back to the "else" branch
4. the warp runs __syncthreads() with the other half of the threads enabled

If my understanding is correct (__syncthreads() can only be called when the
warp doesn't diverge), unrolling a loop that contains a __syncthreads() and
inlining a function that may call __syncthreads() are fine. Am I right?

Jingyue

On Fri, Aug 21, 2015 at 3:11 PM, Jingyue Wu <jingyue at google.com> wrote:

> I'm using 7.0. I am attaching the reduced example.
>
> nvcc sync.cu -arch=sm_35 -ptx
>
> gives
>
>         // .globl       _Z3foov
> .visible .entry _Z3foov(
>
> )
> {
>         .reg .pred      %p<2>;
>         .reg .s32       %r<3>;
>
>
>         mov.u32         %r1, %tid.x;
>         and.b32         %r2, %r1, 1;
>         setp.eq.b32     %p1, %r2, 1;
>         @!%p1 bra       BB7_2;
>         bra.uni         BB7_1;
>
> BB7_1:
>         bar.sync        0;
>         bra.uni         BB7_3;
>
> BB7_2:
>         bar.sync        0;
>
> BB7_3:
>         ret;
> }
>
> As you see, bar.sync is duplicated.
>
> On Fri, Aug 21, 2015 at 1:56 PM, Justin Holewinski <jholewinski at nvidia.com
> > wrote:
>
>> That’s an interesting case.  AFAIK, inlining should be restricted here
>> for the reason you mention.  Inlining should only be valid if it doesn’t
>> duplicate the barrier.  Which nvcc shows this behavior?
>>
>> Adding Vinod and Yuan for comment.
>>
>> > On Aug 21, 2015, at 4:24 PM, Jingyue Wu <jingyue at google.com> wrote:
>> >
>> > Hi Justin,
>> >
>> > Is a compiler allowed to inline a function that calls __syncthreads? I
>> saw nvcc does that, but not sure it's valid though. For example,
>> >
>> > void foo() {
>> >   __syncthreads();
>> > }
>> >
>> > if (threadIdx.x % 2 == 0) {
>> >   ...
>> >   foo();
>> > } else {
>> >   ...
>> >   foo();
>> > }
>> >
>> > Before inlining, all threads meet at one __syncthreads(). After inlining
>> >
>> > if (threadIdx.x % 2 == 0) {
>> >   ...
>> >   __syncthreads();
>> > } else {
>> >   ...
>> >   __syncthreads();
>> > }
>> >
>> > The __syncthreads call is duplicated, and it's no longer guaranteed
>> that all threads can meet one __syncthreads().
>> >
>> > Any thoughts?
>> >
>> > Jingyue
>>
>>
>>
>> -----------------------------------------------------------------------------------
>> This email message is for the sole use of the intended recipient(s) and
>> may contain
>> confidential information.  Any unauthorized review, use, disclosure or
>> distribution
>> is prohibited.  If you are not the intended recipient, please contact the
>> sender by
>> reply email and destroy all copies of the original message.
>>
>> -----------------------------------------------------------------------------------
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150821/52827105/attachment.html>