[llvm-bugs] [Bug 48771] New: [NVPTX] Miscompilation in trivial fixed-stride loop
via llvm-bugs
llvm-bugs at lists.llvm.org
Sat Jan 16 00:49:17 PST 2021
https://bugs.llvm.org/show_bug.cgi?id=48771
Bug ID: 48771
Summary: [NVPTX] Miscompilation in trivial fixed-stride loop
Product: libraries
Version: 11.0
Hardware: PC
OS: Linux
Status: NEW
Severity: release blocker
Priority: P
Component: Backend: PTX
Assignee: unassignedbugs at nondot.org
Reporter: bugs-llvm at fabian-knorr.info
CC: llvm-bugs at lists.llvm.org
Created attachment 24387
--> https://bugs.llvm.org/attachment.cgi?id=24387&action=edit
CUDA source, correct (96 elements) and incorrect (64 element) PTX / SASS
The following CUDA kernel, which should perform two loop iterations per thread
to copy 64 elements, stops short after one iteration (32 elements) when built
with -O3.
__global__ void kernel(const uint32_t *in, uint32_t *out) {
for (int i = threadIdx.x; i < 64; i += 32) {
out[i] = in[i];
}
}
kernel<<<1, 32>>>(in_buf, out_buf);
The full example code is attached. Both clang -O0 and nvcc produce the expected
behavior.
The issue disappears when the constant stride 32 is replaced with blockDim.x
(which also evaluates to 32) or when the number of elements is increased to 96,
for which three iterations are performed as expected.
The PTX (attached) produced by the 64- and 96-element variants is very similar.
Here's a diff of the loop body:
--- bug-64.ptx
+++ bug-96.ptx
@@ -131,11 +131,10 @@
LBB6_2:
ld.global.u32 %r5, [%rd13];
st.global.u32 [%rd12], %r5; // previously initialized: %r6 := i-32
-add.s32 %r4, %r6, 32; // [64] %r4 := i
-setp.lt.u32 %p2, %r4, %r6; // [64] %p2 := i < (i-32) <=> i < 32
+add.s32 %r6, %r6, 32; // [96] i += 32
add.s64 %rd13, %rd13, 128;
add.s64 %rd12, %rd12, 128;
-mov.u32 %r6, %r4; // [64] i += 32
+setp.lt.u32 %p2, %r6, 64; // [96] %p2 := i < 64
@%p2 bra LBB6_2; // next iteration if %p2
LBB6_3:
So for the 64-iteration variant, there is no explicit comparison of i against
the loop count, but a rather convoluted i < (i-32) comparison. While the math
looks correct, I wonder if LLVM is making wrong assumptions about the
definedness of overflow in add.s32 (signed integer addition).
SASS assembly is also attached, in which I'm not fluent unfortunately.
* Offending command line: clang -x cuda bug.cu -obug --cuda-gpu-arch=sm_61 -O3
* Clang version tested: 11.0.1
* CUDA versions tested: 10.1 (latest officially supported by Clang) and 11.2
* Systems tested: x86_64 host and sm_61 / sm_75 GPUs
--
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20210116/92a10872/attachment.html>
More information about the llvm-bugs
mailing list