<html>

    <head>

      <base href="https://bugs.llvm.org/">

    </head>

    <body><table border="1" cellspacing="0" cellpadding="8">

        <tr>

          <th>Bug ID</th>

          <td><a class="bz_bug_link 

          bz_status_NEW "

   title="NEW - [X86_64] Excess movs in XXH64 loop"

   href="https://bugs.llvm.org/show_bug.cgi?id=42545">42545</a>

          </td>

        </tr>

        <tr>

          <th>Summary</th>

          <td>[X86_64]  Excess movs in XXH64 loop

          </td>

        </tr>

        <tr>

          <th>Product</th>

          <td>libraries

          </td>

        </tr>

        <tr>

          <th>Version</th>

          <td>trunk

          </td>

        </tr>

        <tr>

          <th>Hardware</th>

          <td>All

          </td>

        </tr>

        <tr>

          <th>OS</th>

          <td>All

          </td>

        </tr>

        <tr>

          <th>Status</th>

          <td>NEW

          </td>

        </tr>

        <tr>

          <th>Severity</th>

          <td>enhancement

          </td>

        </tr>

        <tr>

          <th>Priority</th>

          <td>P

          </td>

        </tr>

        <tr>

          <th>Component</th>

          <td>Backend: X86

          </td>

        </tr>

        <tr>

          <th>Assignee</th>

          <td>unassignedbugs@nondot.org

          </td>

        </tr>

        <tr>

          <th>Reporter</th>

          <td>husseydevin@gmail.com

          </td>

        </tr>

        <tr>

          <th>CC</th>

          <td>craig.topper@gmail.com, llvm-bugs@lists.llvm.org, llvm-dev@redking.me.uk, spatel+llvm@rotateright.com

          </td>

        </tr></table>

      <p>

        <div>

        <pre>Clang seems to generate extra mov instructions in the XXH64 loop, significantly

hampering performance.

#include <stdint.h>

#include <stddef.h>

static const uint64_t PRIME64_1 = 11400714785074694791ULL;

uint64_t XXH64_mini(const uint64_t* input, size_t len)

{

    const uint64_t *limit = input + len;

    uint64_t v1 = 0;

    do {

        v1 += *input++ * PRIME64_1;

        v1  = (v1 << 31) | (v1 >> (64 - 31)); // rotate

        v1 *= PRIME64_1;

    } while (input<=limit);

    return v1;

}

This is a simplified version of XXH64's loop. 

With -O3, I would expect something like this:

        uint64_t val;

        val  = *input;

        val *= PRIME64_1;

        acc += acc;

        acc  = (acc << 31) | (acc >> (64-31));

        acc *= PRIME64_1;

XXH64_mini:

        movabs  rcx, -7046029288634856825

        lea     rdx, [rdi + 8*rsi]

        xor     eax, eax

.LBB0_1:

        mov     rsi, qword ptr [rdi]

        imul    rsi, rcx

        add     rdi, 8

        add     rax, rsi

        rol     rax, 31

        imul    rax, rcx

        cmp     rdi, rdx

        jbe     .LBB0_1

        ret

However, Clang swaps the add instruction's operands, resulting in an extra mov:

    uint64_t val;

    val  = *input;

    val *= PRIME64_1;

    val += acc;

    val  = (val << 31) | (val >> (64-31));

    acc  = val;

    acc *= PRIME64_1;

XXH64_mini:

        movabs  rcx, -7046029288634856825

        lea     rdx, [rdi + 8*rsi]

        xor     eax, eax

.LBB0_1:

        mov     rsi, qword ptr [rdi]

        imul    rsi, rcx

        add     rdi, 8

        add     rsi, rax   # <<<

        rol     rsi, 31    # <<<

        mov     rax, rsi   # <<<

        imul    rax, rcx

        cmp     rdi, rdx

        jbe     .LBB0_1

        ret

GCC and MSVC both seem to emit the proper code, as seen here:

<a href="https://godbolt.org/z/wu7jDY">https://godbolt.org/z/wu7jDY</a>

It appears that all versions of LLVM do this.

I can't seem to figure out how to further simplify this.</pre>

        </div>

      </p>

      <hr>

      <span>You are receiving this mail because:</span>

      <ul>

          <li>You are on the CC list for the bug.</li>

      </ul>

    </body>

</html>