[llvm-bugs] [Bug 48771] New: [NVPTX] Miscompilation in trivial fixed-stride loop

Sat Jan 16 00:49:17 PST 2021

https://bugs.llvm.org/show_bug.cgi?id=48771

            Bug ID: 48771
           Summary: [NVPTX] Miscompilation in trivial fixed-stride loop
           Product: libraries
           Version: 11.0
          Hardware: PC
                OS: Linux
            Status: NEW
          Severity: release blocker
          Priority: P
         Component: Backend: PTX
          Assignee: unassignedbugs at nondot.org
          Reporter: bugs-llvm at fabian-knorr.info
                CC: llvm-bugs at lists.llvm.org

Created attachment 24387
  --> https://bugs.llvm.org/attachment.cgi?id=24387&action=edit
CUDA source, correct (96 elements) and incorrect (64 element) PTX / SASS

The following CUDA kernel, which should perform two loop iterations per thread
to copy 64 elements, stops short after one iteration (32 elements) when built
with -O3.

    __global__ void kernel(const uint32_t *in, uint32_t *out) {
        for (int i = threadIdx.x; i < 64; i += 32) {
            out[i] = in[i];
        }
    }

    kernel<<<1, 32>>>(in_buf, out_buf);

The full example code is attached. Both clang -O0 and nvcc produce the expected
behavior.

The issue disappears when the constant stride 32 is replaced with blockDim.x
(which also evaluates to 32) or when the number of elements is increased to 96,
for which three iterations are performed as expected.

The PTX (attached) produced by the 64- and 96-element variants is very similar.
Here's a diff of the loop body:

    --- bug-64.ptx
    +++ bug-96.ptx
    @@ -131,11 +131,10 @@
     LBB6_2:
     ld.global.u32 %r5, [%rd13];
     st.global.u32 [%rd12], %r5;    // previously initialized: %r6 := i-32
    -add.s32 %r4, %r6, 32;          // [64] %r4 := i
    -setp.lt.u32 %p2, %r4, %r6;     // [64] %p2 := i < (i-32) <=> i < 32
    +add.s32 %r6, %r6, 32;          // [96] i += 32
     add.s64 %rd13, %rd13, 128;
     add.s64 %rd12, %rd12, 128;
    -mov.u32 %r6, %r4;              // [64] i += 32
    +setp.lt.u32 %p2, %r6, 64;      // [96] %p2 := i < 64
     @%p2 bra LBB6_2;               // next iteration if %p2 
     LBB6_3:

So for the 64-iteration variant, there is no explicit comparison of i against
the loop count, but a rather convoluted i < (i-32) comparison. While the math
looks correct, I wonder if LLVM is making wrong assumptions about the
definedness of overflow in add.s32 (signed integer addition).

SASS assembly is also attached, in which I'm not fluent unfortunately.

* Offending command line: clang -x cuda bug.cu -obug --cuda-gpu-arch=sm_61 -O3
* Clang version tested: 11.0.1
* CUDA versions tested: 10.1 (latest officially supported by Clang) and 11.2
* Systems tested: x86_64 host and sm_61 / sm_75 GPUs

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20210116/92a10872/attachment.html>