[llvm-bugs] [Bug 37426] New: Loop vectorization generates poor code for simple integer loop

Fri May 11 14:11:44 PDT 2018

https://bugs.llvm.org/show_bug.cgi?id=37426

            Bug ID: 37426
           Summary: Loop vectorization generates poor code for simple
                    integer loop
           Product: libraries
           Version: 6.0
          Hardware: PC
                OS: Windows NT
            Status: NEW
          Severity: enhancement
          Priority: P
         Component: Loop Optimizer
          Assignee: unassignedbugs at nondot.org
          Reporter: fabiang at radgametools.com
                CC: llvm-bugs at lists.llvm.org

clang 6.0 with "-O2" on x86-64 produces an enormous amount of code for the
simple C function below, much of it dubious (encountered while investigating a
different bug):

void fancierRotate2(unsigned int *arr, const bool *control, int count, int
rot0, int rot1)
{
    for (int i = 0; i < count; ++i)
    {
        int rot = control[i] ? rot1 : rot0;
        arr[i] = (arr[i] << (rot & 31)) | (arr[i] >> (-rot & 31));
    }
}

I won't post the (long) disassembly here, but here's a Compiler Explorer link:
https://godbolt.org/g/ss4PXM

By contrast, with "-fno-vectorize", the inner loop gets unrolled 2x but is
still short enough to paste (nitpick: why no CMOVs?, but otherwise OK):

.LBB0_9: # =>This Inner Loop Header: Depth=1
  cmpb $0, (%rsi,%rdx)
  movl %eax, %ecx
  je .LBB0_11
  movl %r8d, %ecx
.LBB0_11: # in Loop: Header=BB0_9 Depth=1
  roll %cl, (%rdi,%rdx,4)
  cmpb $0, 1(%rsi,%rdx)
  movl %eax, %ecx
  je .LBB0_13
  movl %r8d, %ecx
.LBB0_13: # in Loop: Header=BB0_9 Depth=1
  roll %cl, 4(%rdi,%rdx,4)
  addq $2, %rdx
  cmpq %rdx, %r10
  jne .LBB0_9

There's several issues at play in that snippet (which I'll try to file as
separate bugs), but first and foremost, the profitability heuristic seems way
off here. Purely going by dynamic instruction count (and ignoring uop counts
and macro-fusion), the non-vectorized version spends around 12 instructions to
process every 2 items, whereas the vectorized version spends 90 for 8. The
scalar version as-is (without CMOVs) might run into frequent mispredicted
branches depending on "control", but purely going by the amount of code-size
blow-up, this seems questionable. (If nothing else, *both* vectorizing 4-wide
and unrolling the result 2x seems a tad much.)

It feels like the vectorizer isn't accounting for the fact that vectorizing
per-lane variable shifts turns into quite a production on pre-AVX2 x86.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20180511/515cf44a/attachment.html>