<html>
<head>
<base href="https://bugs.llvm.org/">
</head>
<body><table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Bug ID</th>
<td><a class="bz_bug_link
bz_status_NEW "
title="NEW - Loop vectorization generates poor code for simple integer loop"
href="https://bugs.llvm.org/show_bug.cgi?id=37426">37426</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>Loop vectorization generates poor code for simple integer loop
</td>
</tr>
<tr>
<th>Product</th>
<td>libraries
</td>
</tr>
<tr>
<th>Version</th>
<td>6.0
</td>
</tr>
<tr>
<th>Hardware</th>
<td>PC
</td>
</tr>
<tr>
<th>OS</th>
<td>Windows NT
</td>
</tr>
<tr>
<th>Status</th>
<td>NEW
</td>
</tr>
<tr>
<th>Severity</th>
<td>enhancement
</td>
</tr>
<tr>
<th>Priority</th>
<td>P
</td>
</tr>
<tr>
<th>Component</th>
<td>Loop Optimizer
</td>
</tr>
<tr>
<th>Assignee</th>
<td>unassignedbugs@nondot.org
</td>
</tr>
<tr>
<th>Reporter</th>
<td>fabiang@radgametools.com
</td>
</tr>
<tr>
<th>CC</th>
<td>llvm-bugs@lists.llvm.org
</td>
</tr></table>
<p>
<div>
<pre>clang 6.0 with "-O2" on x86-64 produces an enormous amount of code for the
simple C function below, much of it dubious (encountered while investigating a
different bug):
void fancierRotate2(unsigned int *arr, const bool *control, int count, int
rot0, int rot1)
{
for (int i = 0; i < count; ++i)
{
int rot = control[i] ? rot1 : rot0;
arr[i] = (arr[i] << (rot & 31)) | (arr[i] >> (-rot & 31));
}
}
I won't post the (long) disassembly here, but here's a Compiler Explorer link:
<a href="https://godbolt.org/g/ss4PXM">https://godbolt.org/g/ss4PXM</a>
By contrast, with "-fno-vectorize", the inner loop gets unrolled 2x but is
still short enough to paste (nitpick: why no CMOVs?, but otherwise OK):
.LBB0_9: # =>This Inner Loop Header: Depth=1
cmpb $0, (%rsi,%rdx)
movl %eax, %ecx
je .LBB0_11
movl %r8d, %ecx
.LBB0_11: # in Loop: Header=BB0_9 Depth=1
roll %cl, (%rdi,%rdx,4)
cmpb $0, 1(%rsi,%rdx)
movl %eax, %ecx
je .LBB0_13
movl %r8d, %ecx
.LBB0_13: # in Loop: Header=BB0_9 Depth=1
roll %cl, 4(%rdi,%rdx,4)
addq $2, %rdx
cmpq %rdx, %r10
jne .LBB0_9
There's several issues at play in that snippet (which I'll try to file as
separate bugs), but first and foremost, the profitability heuristic seems way
off here. Purely going by dynamic instruction count (and ignoring uop counts
and macro-fusion), the non-vectorized version spends around 12 instructions to
process every 2 items, whereas the vectorized version spends 90 for 8. The
scalar version as-is (without CMOVs) might run into frequent mispredicted
branches depending on "control", but purely going by the amount of code-size
blow-up, this seems questionable. (If nothing else, *both* vectorizing 4-wide
and unrolling the result 2x seems a tad much.)
It feels like the vectorizer isn't accounting for the fact that vectorizing
per-lane variable shifts turns into quite a production on pre-AVX2 x86.</pre>
</div>
</p>
<hr>
<span>You are receiving this mail because:</span>
<ul>
<li>You are on the CC list for the bug.</li>
</ul>
</body>
</html>