[llvm-dev] [CUDA/NVPTX] is inlining __syncthreads allowed?
Jingyue Wu via llvm-dev
llvm-dev at lists.llvm.org
Fri Aug 21 16:51:27 PDT 2015
Looking at this section
<http://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-bar>
in
the PTX ISA, there's a sentence saying:
> In conditionally executed code, a bar instruction should only be used if
it is known that all threads evaluate the
> condition identically (the warp does not diverge).
Does that mean __syncthreads should only be called uniformly when no
threads diverge? If so, my sync.cu example is undefined. The reason is
that, although every threads reach __syncthreads, they are reaching them
divergently:
1. threads diverge at the "if" statement
2. the warp runs __syncthreads() with half of the threads enabled
3. the warp jumps back to the "else" branch
4. the warp runs __syncthreads() with the other half of the threads enabled
If my understanding is correct (__syncthreads() can only be called when the
warp doesn't diverge), unrolling a loop that contains a __syncthreads() and
inlining a function that may call __syncthreads() are fine. Am I right?
Jingyue
On Fri, Aug 21, 2015 at 3:11 PM, Jingyue Wu <jingyue at google.com> wrote:
> I'm using 7.0. I am attaching the reduced example.
>
> nvcc sync.cu -arch=sm_35 -ptx
>
> gives
>
> // .globl _Z3foov
> .visible .entry _Z3foov(
>
> )
> {
> .reg .pred %p<2>;
> .reg .s32 %r<3>;
>
>
> mov.u32 %r1, %tid.x;
> and.b32 %r2, %r1, 1;
> setp.eq.b32 %p1, %r2, 1;
> @!%p1 bra BB7_2;
> bra.uni BB7_1;
>
> BB7_1:
> bar.sync 0;
> bra.uni BB7_3;
>
> BB7_2:
> bar.sync 0;
>
> BB7_3:
> ret;
> }
>
> As you see, bar.sync is duplicated.
>
> On Fri, Aug 21, 2015 at 1:56 PM, Justin Holewinski <jholewinski at nvidia.com
> > wrote:
>
>> That’s an interesting case. AFAIK, inlining should be restricted here
>> for the reason you mention. Inlining should only be valid if it doesn’t
>> duplicate the barrier. Which nvcc shows this behavior?
>>
>> Adding Vinod and Yuan for comment.
>>
>> > On Aug 21, 2015, at 4:24 PM, Jingyue Wu <jingyue at google.com> wrote:
>> >
>> > Hi Justin,
>> >
>> > Is a compiler allowed to inline a function that calls __syncthreads? I
>> saw nvcc does that, but not sure it's valid though. For example,
>> >
>> > void foo() {
>> > __syncthreads();
>> > }
>> >
>> > if (threadIdx.x % 2 == 0) {
>> > ...
>> > foo();
>> > } else {
>> > ...
>> > foo();
>> > }
>> >
>> > Before inlining, all threads meet at one __syncthreads(). After inlining
>> >
>> > if (threadIdx.x % 2 == 0) {
>> > ...
>> > __syncthreads();
>> > } else {
>> > ...
>> > __syncthreads();
>> > }
>> >
>> > The __syncthreads call is duplicated, and it's no longer guaranteed
>> that all threads can meet one __syncthreads().
>> >
>> > Any thoughts?
>> >
>> > Jingyue
>>
>>
>>
>> -----------------------------------------------------------------------------------
>> This email message is for the sole use of the intended recipient(s) and
>> may contain
>> confidential information. Any unauthorized review, use, disclosure or
>> distribution
>> is prohibited. If you are not the intended recipient, please contact the
>> sender by
>> reply email and destroy all copies of the original message.
>>
>> -----------------------------------------------------------------------------------
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150821/52827105/attachment.html>
More information about the llvm-dev
mailing list