[llvm-bugs] [Bug 37426] New: Loop vectorization generates poor code for simple integer loop
via llvm-bugs
llvm-bugs at lists.llvm.org
Fri May 11 14:11:44 PDT 2018
https://bugs.llvm.org/show_bug.cgi?id=37426
Bug ID: 37426
Summary: Loop vectorization generates poor code for simple
integer loop
Product: libraries
Version: 6.0
Hardware: PC
OS: Windows NT
Status: NEW
Severity: enhancement
Priority: P
Component: Loop Optimizer
Assignee: unassignedbugs at nondot.org
Reporter: fabiang at radgametools.com
CC: llvm-bugs at lists.llvm.org
clang 6.0 with "-O2" on x86-64 produces an enormous amount of code for the
simple C function below, much of it dubious (encountered while investigating a
different bug):
void fancierRotate2(unsigned int *arr, const bool *control, int count, int
rot0, int rot1)
{
for (int i = 0; i < count; ++i)
{
int rot = control[i] ? rot1 : rot0;
arr[i] = (arr[i] << (rot & 31)) | (arr[i] >> (-rot & 31));
}
}
I won't post the (long) disassembly here, but here's a Compiler Explorer link:
https://godbolt.org/g/ss4PXM
By contrast, with "-fno-vectorize", the inner loop gets unrolled 2x but is
still short enough to paste (nitpick: why no CMOVs?, but otherwise OK):
.LBB0_9: # =>This Inner Loop Header: Depth=1
cmpb $0, (%rsi,%rdx)
movl %eax, %ecx
je .LBB0_11
movl %r8d, %ecx
.LBB0_11: # in Loop: Header=BB0_9 Depth=1
roll %cl, (%rdi,%rdx,4)
cmpb $0, 1(%rsi,%rdx)
movl %eax, %ecx
je .LBB0_13
movl %r8d, %ecx
.LBB0_13: # in Loop: Header=BB0_9 Depth=1
roll %cl, 4(%rdi,%rdx,4)
addq $2, %rdx
cmpq %rdx, %r10
jne .LBB0_9
There's several issues at play in that snippet (which I'll try to file as
separate bugs), but first and foremost, the profitability heuristic seems way
off here. Purely going by dynamic instruction count (and ignoring uop counts
and macro-fusion), the non-vectorized version spends around 12 instructions to
process every 2 items, whereas the vectorized version spends 90 for 8. The
scalar version as-is (without CMOVs) might run into frequent mispredicted
branches depending on "control", but purely going by the amount of code-size
blow-up, this seems questionable. (If nothing else, *both* vectorizing 4-wide
and unrolling the result 2x seems a tad much.)
It feels like the vectorizer isn't accounting for the fact that vectorizing
per-lane variable shifts turns into quite a production on pre-AVX2 x86.
--
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20180511/515cf44a/attachment.html>
More information about the llvm-bugs
mailing list