<html>
    <head>
      <base href="https://bugs.llvm.org/">
    </head>
    <body><table border="1" cellspacing="0" cellpadding="8">
        <tr>
          <th>Bug ID</th>
          <td><a class="bz_bug_link 
          bz_status_NEW "
   title="NEW - Performance regression 4.0 to 6.0 due to unrolling the first trip through an SSE2 ASCII validation loop"
   href="https://bugs.llvm.org/show_bug.cgi?id=37266">37266</a>
          </td>
        </tr>

        <tr>
          <th>Summary</th>
          <td>Performance regression 4.0 to 6.0 due to unrolling the first trip through an SSE2 ASCII validation loop
          </td>
        </tr>

        <tr>
          <th>Product</th>
          <td>new-bugs
          </td>
        </tr>

        <tr>
          <th>Version</th>
          <td>6.0
          </td>
        </tr>

        <tr>
          <th>Hardware</th>
          <td>PC
          </td>
        </tr>

        <tr>
          <th>OS</th>
          <td>All
          </td>
        </tr>

        <tr>
          <th>Status</th>
          <td>NEW
          </td>
        </tr>

        <tr>
          <th>Severity</th>
          <td>normal
          </td>
        </tr>

        <tr>
          <th>Priority</th>
          <td>P
          </td>
        </tr>

        <tr>
          <th>Component</th>
          <td>new bugs
          </td>
        </tr>

        <tr>
          <th>Assignee</th>
          <td>unassignedbugs@nondot.org
          </td>
        </tr>

        <tr>
          <th>Reporter</th>
          <td>hsivonen@hsivonen.fi
          </td>
        </tr>

        <tr>
          <th>CC</th>
          <td>llvm-bugs@lists.llvm.org
          </td>
        </tr></table>
      <p>
        <div>
        <pre>Minimized test case: <a href="https://github.com/hsivonen/llvm_ascii_validation">https://github.com/hsivonen/llvm_ascii_validation</a> (assumes
that <a href="https://rustup.rs/">https://rustup.rs/</a> is installed.)

When rustc switched from LLVM 4.0 to LLVM 6.0, Firefox's Rust-based SSE2-using
UTF-8 validation function regressed up to 12.5% in performance (in CI on EC2).

What changed was that LLVM 6.0 unrolls the first trip through the innermost
SSE2 ASCII validation loop, but LLVM 4.0 didn't unroll it and compiled it to
the most obvious form.

The basic block of the loop as produced by LLVM 4.0:

.LBB0_6:
        movdqu  (%rdi,%rax), %xmm0
        pmovmskb        %xmm0, %edx
        testl   %edx, %edx
        jne     .LBB0_7
        addq    $16, %rax
        cmpq    %rcx, %rax
        jbe     .LBB0_6
        jmp     .LBB0_2

The unrolled part and the actual loop as produced by LLVM 6.0:

        .cfi_startproc
        cmpq    $16, %rsi
        jb      .LBB0_1
        movdqu  (%rdi), %xmm0
        pmovmskb        %xmm0, %ecx
        testl   %ecx, %ecx
        je      .LBB0_10
        xorl    %esi, %esi
        testl   %ecx, %ecx
        je      .LBB0_7
.LBB0_8:
        bsfl    %ecx, %eax
        jmp     .LBB0_9
.LBB0_1:
        xorl    %eax, %eax
        cmpq    %rsi, %rax
        jb      .LBB0_13
.LBB0_15:
        movq    %rsi, %rax
        retq
.LBB0_10:
        leaq    -16(%rsi), %rdx
        movl    $16, %eax
        .p2align        4, 0x90
.LBB0_11:
        cmpq    %rdx, %rax
        ja      .LBB0_12
        movdqu  (%rdi,%rax), %xmm0
        pmovmskb        %xmm0, %ecx
        addq    $16, %rax
        testl   %ecx, %ecx
        je      .LBB0_11
        addq    $-16, %rax
        movq    %rax, %rsi
        testl   %ecx, %ecx
        jne     .LBB0_8

The unrolling is visible already on the LLVM-IR level. Both cases used
opt-level 2, which is what Firefox ships with.

# Benchmarking the minimized test case looks like this to me:

x86_64 code running on Intel(R) Xeon(R) CPU E5-2637 v4 @ 3.50GHz (Broadwell-EP)

Similar results with both the powersave and performance governors.

$ rustup default 1.24.0
$ ./bench.sh
[...]
test bench ... bench:   1,539,341 ns/iter (+/- 216,985)

$ rustup default 1.25.0
$ ./bench.sh
[...]
test bench ... bench:   1,865,801 ns/iter (+/- 22,297)

x86_64 code running on Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz (Haswell-DT)

$ rustup default 1.24.0
$ ./bench.sh
[...]
test bench ... bench:   1,491,560 ns/iter (+/- 65,163)

$ rustup default 1.25.0
$ ./bench.sh
[...]
test bench ... bench:   1,673,239 ns/iter (+/- 15,355)

# Links to other bug databases

Rust: <a href="https://github.com/rust-lang/rust/issues/49873">https://github.com/rust-lang/rust/issues/49873</a>
Firefox: <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=1451703">https://bugzilla.mozilla.org/show_bug.cgi?id=1451703</a></pre>
        </div>
      </p>


      <hr>
      <span>You are receiving this mail because:</span>

      <ul>
          <li>You are on the CC list for the bug.</li>
      </ul>
    </body>
</html>