[llvm-bugs] [Bug 27738] New: ptxas miscompile with divergent branches and convergent operations

via llvm-bugs llvm-bugs at lists.llvm.org
Fri May 13 13:07:27 PDT 2016


https://llvm.org/bugs/show_bug.cgi?id=27738

            Bug ID: 27738
           Summary: ptxas miscompile with divergent branches and
                    convergent operations
           Product: libraries
           Version: trunk
          Hardware: All
                OS: All
            Status: NEW
          Severity: normal
          Priority: P
         Component: Backend: PTX
          Assignee: unassignedbugs at nondot.org
          Reporter: justin.lebar at gmail.com
                CC: llvm-bugs at lists.llvm.org
    Classification: Unclassified

We've discovered an apparent miscompile in ptxas with divergent branches and
convergent operations.  At the moment we do not have a workaround in LLVM, and
I'm not sure one is possible in general.  I'm filing this bug so that we have
something to point people to if and when they hit this problem.

As far as I can tell, this issue is not specific to LLVM; I think it should
affect any compiler which emits ptx, including nvcc.

== Background ==

When targeting an nvidia GPU, LLVM compiles IR to ptx, a high-level,
device-independent assembly language.  clang then passes the generated ptx to
ptxas, a closed-source optimizing assembler which transforms the ptx to SASS,
the device-specific machine-code.  The SASS ISA is not public, but a
disassembler ships with the CUDA toolkit, so we can (more or less) understand
the machine code.

Nvidia GPUs execute threads in groups of 32, called "warps".  All threads in a
warp execute in lockstep (so, if you will, the hardware is a 32-wide SIMD unit,
and a warp is basically a hardware thread).

When a warp encounters a conditional branch, some threads in a warp may take
one path, while others may take the other.  When this occurs, the warp is said
to "diverge".  When the hardware executes a divergent branch, it chooses a
direction to take, pushes the current IP onto a hardware stack, and continues
executing, with the SIMD lanes (i.e., "threads") which didn't take this
direction disabled.

At some point, the hardware encounters a special instruction which tells it to
reconverge.  At this point it pops the hardware stack and executes the other
side of the branch.  When the second side reaches the special instruction, the
warp reconverges, we reset the SIMD mask, and we continue running.

(If you're looking for a detailed explanation of this mechanism, the best
reference I've found is this paper, by Bialas and Strzelecki:
http://arxiv.org/abs/1504.01650.)

Warp divergence can have large performance implications, but for the most part
programmers (and compiler developers) can ignore divergence for the purposes of
correctness.  However, some GPU instructions have observable differences if
executed in a converged vs. a diverged state.  It's therefore important that if
the programmer thinks that a particular convergent instruction is executed in a
convergent state that the compiler maintains this invariant, and doesn't change
things so it ends up being executed in a divergent state.

It's this invariant that ptxas appears to be breaking.

== Details ==

Consider the following CUDA code.

  int *p = ...
  if (tid == 0) *p = 42;
  __syncthreads();
  if (tid == 32 && *p != 42) asm("trap;");

The intent here is that thread 0 in warp 0 sets *p to 42, then all warps wait
at the barrier.  Then thread 0 in warp 1 checks that *p == 42.

I'll use this example below, but I observe the same behavior if we do an
intra-, rather than inter-, warp check: i.e., s/tid == 32/tid == 1/.

Clang generates the following ptx for this snippet:

    ld.u32 %r9, [%rd2];      // Load tid
    setp.eq.s32 %p5, %r9, 0; // if tid == 0 goto LBB12_4, else goto LBB12_5.
    @%p5 bra LBB12_4;
    bra.uni LBB12_5;
  LBB12_4:
    mov.u32 %r6, 42;
    st.u32 [%rd3], %r6;
  LBB12_5:
    bar.sync 0;               // __syncthreads()
    setp.ne.s32 %p6, %r9, 32; // if tid != 32 goto LBB12_11, else proceed.
    @%p6 bra LBB12_11;
    ld.u32 %r7, [%rd3];
    setp.eq.s32 %p7, %r7, 42; // if p[0] == 42 goto LBB12_11, else proceed.
    @%p7 bra LBB12_11;
    trap;
  LBB12_11:
  // proceed normally

As far as I can tell, there's nothing wrong with this. It's pretty similar to
what nvcc outputs.

When compiled witinin the appropriate context, this particular code traps with
ptxas -O1, but does not trap at -O2.

Here's the SASS at ptxas -O1. It's pretty much a literal translation of the
ptx.

           ISETP.NE.AND P0, PT, R0, RZ, PT;
  /*1f28*/ @P0 BRA 0x1f50;                    // branch if tid != 0
           MOV32I R0, 0x2a;
           ST.E [R4], R0;                     // p[0] = 42
  /*1f50*/ BAR.SYNC 0x0;                      // __syncthreads()
           ISETP.NE.AND P0, PT, R0, 0x20, PT;
           @P0 BRA 0x1fd8;                    // branch if tid != 32
           LD.E R0, [R4];
           ISETP.EQ.AND P0, PT, R0, 0x2a, PT; // branch if p[0] == 42
           @P0 BRA 0x1fd8;
           BPT.TRAP 0x1;
  /*1fd8*/ [proceed normally]

At ptxas -O2, the only interesting difference I see is that the two
instructions in the tid == 0 block are predicated, instead of being behind a
branch.

           ISETP.NE.AND P0, PT, R4, RZ, PT;
           @!P0 MOV32I R5, 0x2a;    // if tid == 0, R5 = 42
           @!P0 ST.E [R10], R5;     // if tid == 0, *p = R5.
           BAR.SYNC 0x0;            // __syncthreads()
           [same as above]

Again, the O1 code traps, and the O2 code does not.

My theory, based on this behavior and on what I can gather about how the branch
synchronization stack works, is that when we hit the conditional branch in the
O1 code at 0x1f28, the warp diverges, we take the branch, and we keep running
until we hit a .S instruction much later in our execution flow, which
re-synchronizes the warp. In particular, we execute our bar.sync **in a
divergent state**, which allows other warps to continue and read *p, which
hasn't yet been set to 42.  (Recall that bar.sync operates on warps, not
threads.)

The behavior is similar when we change the code to do an intra-warp check --
s/tid == 32/tid == 1/. Without a .S instruction before the bar.sync, the warp
just keeps running in a divergent fashion.

I have tested with ptxas 7.0, 7.5, and 8.0; I get the same behavior (trap vs.
no trap) with all of them, although I've only checked the sass from ptxas 7.0.
I've only tested on my Tesla K40c.

== Discussion ==

Although the example above shows a difference in behavior with ptxas -O1 vs
-O2, it is possible to concoct examples that exhibit the bad behavior at ptxas
-O2.  In fact one of the thrust [0] testcases fails at O2 with clang,
apparently
due to this bug (thrust uses the "if thread 0 write some state, then
syncthreads" idiom in many places).

I unfortunately haven't been able to come up with a good minimized testcase for
this issue.  With a simple testcase based only on the code above, ptxas outputs
the appropriate .S instructions to resynchronize the warp.  You need something
more, but without the source code to ptxas, it's very hard to say what.  It's
possible ptxas is trying to insert the .S instruction in the right place, but
we're able to confuse it *just enough* that it does the wrong thing.

But the net result is that a convergent instruction placed after a divergent
branch does not necessarily do the right thing.

As far as I can tell, nvcc doesn't do anything special to avoid this issue.

I've spoken with Justin Holewinski from nvidia, and he's said that we've
rediscovered a longstanding, known issue in ptxas, and that he's not aware of a
workaround.

[0] https://github.com/thrust/thrust

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20160513/7c1b4652/attachment.html>


More information about the llvm-bugs mailing list