<html>

    <head>

      <base href="https://bugs.llvm.org/">

    </head>

    <body><table border="1" cellspacing="0" cellpadding="8">

        <tr>

          <th>Bug ID</th>

          <td><a class="bz_bug_link 

          bz_status_NEW "

   title="NEW - Loop vectorization generates poor code for simple integer loop"

   href="https://bugs.llvm.org/show_bug.cgi?id=37426">37426</a>

          </td>

        </tr>

        <tr>

          <th>Summary</th>

          <td>Loop vectorization generates poor code for simple integer loop

          </td>

        </tr>

        <tr>

          <th>Product</th>

          <td>libraries

          </td>

        </tr>

        <tr>

          <th>Version</th>

          <td>6.0

          </td>

        </tr>

        <tr>

          <th>Hardware</th>

          <td>PC

          </td>

        </tr>

        <tr>

          <th>OS</th>

          <td>Windows NT

          </td>

        </tr>

        <tr>

          <th>Status</th>

          <td>NEW

          </td>

        </tr>

        <tr>

          <th>Severity</th>

          <td>enhancement

          </td>

        </tr>

        <tr>

          <th>Priority</th>

          <td>P

          </td>

        </tr>

        <tr>

          <th>Component</th>

          <td>Loop Optimizer

          </td>

        </tr>

        <tr>

          <th>Assignee</th>

          <td>unassignedbugs@nondot.org

          </td>

        </tr>

        <tr>

          <th>Reporter</th>

          <td>fabiang@radgametools.com

          </td>

        </tr>

        <tr>

          <th>CC</th>

          <td>llvm-bugs@lists.llvm.org

          </td>

        </tr></table>

      <p>

        <div>

        <pre>clang 6.0 with "-O2" on x86-64 produces an enormous amount of code for the

simple C function below, much of it dubious (encountered while investigating a

different bug):

void fancierRotate2(unsigned int *arr, const bool *control, int count, int

rot0, int rot1)

{

    for (int i = 0; i < count; ++i)

    {

        int rot = control[i] ? rot1 : rot0;

        arr[i] = (arr[i] << (rot & 31)) | (arr[i] >> (-rot & 31));

    }

}

I won't post the (long) disassembly here, but here's a Compiler Explorer link:

<a href="https://godbolt.org/g/ss4PXM">https://godbolt.org/g/ss4PXM</a>

By contrast, with "-fno-vectorize", the inner loop gets unrolled 2x but is

still short enough to paste (nitpick: why no CMOVs?, but otherwise OK):

.LBB0_9: # =>This Inner Loop Header: Depth=1

  cmpb $0, (%rsi,%rdx)

  movl %eax, %ecx

  je .LBB0_11

  movl %r8d, %ecx

.LBB0_11: # in Loop: Header=BB0_9 Depth=1

  roll %cl, (%rdi,%rdx,4)

  cmpb $0, 1(%rsi,%rdx)

  movl %eax, %ecx

  je .LBB0_13

  movl %r8d, %ecx

.LBB0_13: # in Loop: Header=BB0_9 Depth=1

  roll %cl, 4(%rdi,%rdx,4)

  addq $2, %rdx

  cmpq %rdx, %r10

  jne .LBB0_9

There's several issues at play in that snippet (which I'll try to file as

separate bugs), but first and foremost, the profitability heuristic seems way

off here. Purely going by dynamic instruction count (and ignoring uop counts

and macro-fusion), the non-vectorized version spends around 12 instructions to

process every 2 items, whereas the vectorized version spends 90 for 8. The

scalar version as-is (without CMOVs) might run into frequent mispredicted

branches depending on "control", but purely going by the amount of code-size

blow-up, this seems questionable. (If nothing else, *both* vectorizing 4-wide

and unrolling the result 2x seems a tad much.)

It feels like the vectorizer isn't accounting for the fact that vectorizing

per-lane variable shifts turns into quite a production on pre-AVX2 x86.</pre>

        </div>

      </p>

      <hr>

      <span>You are receiving this mail because:</span>

      <ul>

          <li>You are on the CC list for the bug.</li>

      </ul>

    </body>

</html>