<html>
<head>
<base href="https://bugs.llvm.org/">
</head>
<body><table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Bug ID</th>
<td><a class="bz_bug_link
bz_status_NEW "
title="NEW - [NVPTX] Miscompilation in trivial fixed-stride loop"
href="https://bugs.llvm.org/show_bug.cgi?id=48771">48771</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>[NVPTX] Miscompilation in trivial fixed-stride loop
</td>
</tr>
<tr>
<th>Product</th>
<td>libraries
</td>
</tr>
<tr>
<th>Version</th>
<td>11.0
</td>
</tr>
<tr>
<th>Hardware</th>
<td>PC
</td>
</tr>
<tr>
<th>OS</th>
<td>Linux
</td>
</tr>
<tr>
<th>Status</th>
<td>NEW
</td>
</tr>
<tr>
<th>Severity</th>
<td>release blocker
</td>
</tr>
<tr>
<th>Priority</th>
<td>P
</td>
</tr>
<tr>
<th>Component</th>
<td>Backend: PTX
</td>
</tr>
<tr>
<th>Assignee</th>
<td>unassignedbugs@nondot.org
</td>
</tr>
<tr>
<th>Reporter</th>
<td>bugs-llvm@fabian-knorr.info
</td>
</tr>
<tr>
<th>CC</th>
<td>llvm-bugs@lists.llvm.org
</td>
</tr></table>
<p>
<div>
<pre>Created <span class=""><a href="attachment.cgi?id=24387" name="attach_24387" title="CUDA source, correct (96 elements) and incorrect (64 element) PTX / SASS">attachment 24387</a> <a href="attachment.cgi?id=24387&action=edit" title="CUDA source, correct (96 elements) and incorrect (64 element) PTX / SASS">[details]</a></span>
CUDA source, correct (96 elements) and incorrect (64 element) PTX / SASS
The following CUDA kernel, which should perform two loop iterations per thread
to copy 64 elements, stops short after one iteration (32 elements) when built
with -O3.
__global__ void kernel(const uint32_t *in, uint32_t *out) {
for (int i = threadIdx.x; i < 64; i += 32) {
out[i] = in[i];
}
}
kernel<<<1, 32>>>(in_buf, out_buf);
The full example code is attached. Both clang -O0 and nvcc produce the expected
behavior.
The issue disappears when the constant stride 32 is replaced with blockDim.x
(which also evaluates to 32) or when the number of elements is increased to 96,
for which three iterations are performed as expected.
The PTX (attached) produced by the 64- and 96-element variants is very similar.
Here's a diff of the loop body:
--- bug-64.ptx
+++ bug-96.ptx
@@ -131,11 +131,10 @@
LBB6_2:
ld.global.u32 %r5, [%rd13];
st.global.u32 [%rd12], %r5; // previously initialized: %r6 := i-32
-add.s32 %r4, %r6, 32; // [64] %r4 := i
-setp.lt.u32 %p2, %r4, %r6; // [64] %p2 := i < (i-32) <=> i < 32
+add.s32 %r6, %r6, 32; // [96] i += 32
add.s64 %rd13, %rd13, 128;
add.s64 %rd12, %rd12, 128;
-mov.u32 %r6, %r4; // [64] i += 32
+setp.lt.u32 %p2, %r6, 64; // [96] %p2 := i < 64
@%p2 bra LBB6_2; // next iteration if %p2
LBB6_3:
So for the 64-iteration variant, there is no explicit comparison of i against
the loop count, but a rather convoluted i < (i-32) comparison. While the math
looks correct, I wonder if LLVM is making wrong assumptions about the
definedness of overflow in add.s32 (signed integer addition).
SASS assembly is also attached, in which I'm not fluent unfortunately.
* Offending command line: clang -x cuda bug.cu -obug --cuda-gpu-arch=sm_61 -O3
* Clang version tested: 11.0.1
* CUDA versions tested: 10.1 (latest officially supported by Clang) and 11.2
* Systems tested: x86_64 host and sm_61 / sm_75 GPUs</pre>
</div>
</p>
<hr>
<span>You are receiving this mail because:</span>
<ul>
<li>You are on the CC list for the bug.</li>
</ul>
</body>
</html>