<html>
    <head>
      <base href="https://bugs.llvm.org/">
    </head>
    <body><table border="1" cellspacing="0" cellpadding="8">
        <tr>
          <th>Bug ID</th>
          <td><a class="bz_bug_link 
          bz_status_NEW "
   title="NEW - Performance regression for sqrt, sqrtf, sqrtl in for loop"
   href="https://bugs.llvm.org/show_bug.cgi?id=46773">46773</a>
          </td>
        </tr>

        <tr>
          <th>Summary</th>
          <td>Performance regression for sqrt, sqrtf, sqrtl in for loop
          </td>
        </tr>

        <tr>
          <th>Product</th>
          <td>new-bugs
          </td>
        </tr>

        <tr>
          <th>Version</th>
          <td>10.0
          </td>
        </tr>

        <tr>
          <th>Hardware</th>
          <td>PC
          </td>
        </tr>

        <tr>
          <th>OS</th>
          <td>Linux
          </td>
        </tr>

        <tr>
          <th>Status</th>
          <td>NEW
          </td>
        </tr>

        <tr>
          <th>Severity</th>
          <td>enhancement
          </td>
        </tr>

        <tr>
          <th>Priority</th>
          <td>P
          </td>
        </tr>

        <tr>
          <th>Component</th>
          <td>new bugs
          </td>
        </tr>

        <tr>
          <th>Assignee</th>
          <td>unassignedbugs@nondot.org
          </td>
        </tr>

        <tr>
          <th>Reporter</th>
          <td>joshstockin@gmail.com
          </td>
        </tr>

        <tr>
          <th>CC</th>
          <td>htmldeveloper@gmail.com, llvm-bugs@lists.llvm.org
          </td>
        </tr></table>
      <p>
        <div>
        <pre>clang version 10.0.0-4ubuntu1 
    Target: x86_64-pc-linux-gnu
    Thread model: posix

This might actually be a twofer.

Consider the following function, which calculates the number of primes up to
10,000,000:

    void primes() {
        uint32_t NUM = 10000000;
        uint32_t total = 1;
        for (uint32_t N = 3; N < NUM; N += 2) {
            uint32_t prime = 1;
            for (uint32_t f = 3; f <= (uint32_t)sqrtf(N); f += 2) {
                if (N % f == 0) {
                    prime = 0;
                    break;
                }
            }
            total += prime;
        }
        printf("Total primes: %d\n", total);
    }

Note the for loop with sqrt (this also applies to sqrtf and sqrtl):

    for (uint32_t f = 3; f <= (uint32_t)sqrt(N); f += 2)

First optimization bug: LLVM doesn't seem to recognize sqrt* as pure functions,
and so doesn't pull them out of the condition--instead, the sqrt operation is
called repeatedly for every comparison, tremendously slowing down the program.
(This was compiled with -O3.)

; This source compiled including the sqrt() function
; There are similar results for other sqrt functions
    .L19:
            mov     eax, ebx
            xor     edx, edx
            div     r12d
            test    edx, edx                ; N % f == 0
            je      .L4                     ; true, jump
            movsd   xmm1, QWORD PTR [rsp+8] ; false, continue loop
            add     r12d, 2                 ; f += 2
    .L5:
            movapd  xmm0, xmm1              ; Why do this block again?
            movsd   QWORD PTR [rsp+8], xmm1 ; N hasn't changed, and sqrt*
            call    sqrt                    ; are pure functions.
            pxor    xmm2, xmm2              ; This makes the executable
            cvttsd2si       rax, xmm0       ; ~2x slower than gcc -O3's.
            cmp     r12d, eax
            jbe     .L19

You may have noticed the second optimization bug: the sqrt* call here isn't
optimized down to the processor's sqrt instruction. (These are sqrtsd for
sqrt() and sqrtss for sqrtf(). It doesn't look like there's an instruction for
sqrtl(), but if there is, LLVM should optimize to it.) Also note that there are
extraneous sqrt* calls/instructions in the full compiled output, for whatever
reasons LLVM's optimization engine has, but they're inconsistent. In the same
compiled output there are mixed function calls and instructions for the same
operation.

Some speed benchmarks for emphasis:

    $ time ./primes-clangO3  => 0m5.298s real
    Total primes: 664579

    $ time ./primes-clangO0  => 0m5.701s "
    Total primes: 644579

    $ time ./primes-gccO3    => 0m2.872s "
    Total primes: 664579

    $ time ./primes-gccO0    => 0m5.703s "
    Total primes: 664579

I hope there's an easy fix here.</pre>
        </div>
      </p>


      <hr>
      <span>You are receiving this mail because:</span>

      <ul>
          <li>You are on the CC list for the bug.</li>
      </ul>
    </body>
</html>