<html>

    <head>

      <base href="https://bugs.llvm.org/">

    </head>

    <body><table border="1" cellspacing="0" cellpadding="8">

        <tr>

          <th>Bug ID</th>

          <td><a class="bz_bug_link 

          bz_status_NEW "

   title="NEW - Performance regression for sqrt, sqrtf, sqrtl in for loop"

   href="https://bugs.llvm.org/show_bug.cgi?id=46773">46773</a>

          </td>

        </tr>

        <tr>

          <th>Summary</th>

          <td>Performance regression for sqrt, sqrtf, sqrtl in for loop

          </td>

        </tr>

        <tr>

          <th>Product</th>

          <td>new-bugs

          </td>

        </tr>

        <tr>

          <th>Version</th>

          <td>10.0

          </td>

        </tr>

        <tr>

          <th>Hardware</th>

          <td>PC

          </td>

        </tr>

        <tr>

          <th>OS</th>

          <td>Linux

          </td>

        </tr>

        <tr>

          <th>Status</th>

          <td>NEW

          </td>

        </tr>

        <tr>

          <th>Severity</th>

          <td>enhancement

          </td>

        </tr>

        <tr>

          <th>Priority</th>

          <td>P

          </td>

        </tr>

        <tr>

          <th>Component</th>

          <td>new bugs

          </td>

        </tr>

        <tr>

          <th>Assignee</th>

          <td>unassignedbugs@nondot.org

          </td>

        </tr>

        <tr>

          <th>Reporter</th>

          <td>joshstockin@gmail.com

          </td>

        </tr>

        <tr>

          <th>CC</th>

          <td>htmldeveloper@gmail.com, llvm-bugs@lists.llvm.org

          </td>

        </tr></table>

      <p>

        <div>

        <pre>clang version 10.0.0-4ubuntu1 

    Target: x86_64-pc-linux-gnu

    Thread model: posix

This might actually be a twofer.

Consider the following function, which calculates the number of primes up to

10,000,000:

    void primes() {

        uint32_t NUM = 10000000;

        uint32_t total = 1;

        for (uint32_t N = 3; N < NUM; N += 2) {

            uint32_t prime = 1;

            for (uint32_t f = 3; f <= (uint32_t)sqrtf(N); f += 2) {

                if (N % f == 0) {

                    prime = 0;

                    break;

                }

            }

            total += prime;

        }

        printf("Total primes: %d\n", total);

    }

Note the for loop with sqrt (this also applies to sqrtf and sqrtl):

    for (uint32_t f = 3; f <= (uint32_t)sqrt(N); f += 2)

First optimization bug: LLVM doesn't seem to recognize sqrt* as pure functions,

and so doesn't pull them out of the condition--instead, the sqrt operation is

called repeatedly for every comparison, tremendously slowing down the program.

(This was compiled with -O3.)

; This source compiled including the sqrt() function

; There are similar results for other sqrt functions

    .L19:

            mov     eax, ebx

            xor     edx, edx

            div     r12d

            test    edx, edx                ; N % f == 0

            je      .L4                     ; true, jump

            movsd   xmm1, QWORD PTR [rsp+8] ; false, continue loop

            add     r12d, 2                 ; f += 2

    .L5:

            movapd  xmm0, xmm1              ; Why do this block again?

            movsd   QWORD PTR [rsp+8], xmm1 ; N hasn't changed, and sqrt*

            call    sqrt                    ; are pure functions.

            pxor    xmm2, xmm2              ; This makes the executable

            cvttsd2si       rax, xmm0       ; ~2x slower than gcc -O3's.

            cmp     r12d, eax

            jbe     .L19

You may have noticed the second optimization bug: the sqrt* call here isn't

optimized down to the processor's sqrt instruction. (These are sqrtsd for

sqrt() and sqrtss for sqrtf(). It doesn't look like there's an instruction for

sqrtl(), but if there is, LLVM should optimize to it.) Also note that there are

extraneous sqrt* calls/instructions in the full compiled output, for whatever

reasons LLVM's optimization engine has, but they're inconsistent. In the same

compiled output there are mixed function calls and instructions for the same

operation.

Some speed benchmarks for emphasis:

    $ time ./primes-clangO3  => 0m5.298s real

    Total primes: 664579

    $ time ./primes-clangO0  => 0m5.701s "

    Total primes: 644579

    $ time ./primes-gccO3    => 0m2.872s "

    Total primes: 664579

    $ time ./primes-gccO0    => 0m5.703s "

    Total primes: 664579

I hope there's an easy fix here.</pre>

        </div>

      </p>

      <hr>

      <span>You are receiving this mail because:</span>

      <ul>

          <li>You are on the CC list for the bug.</li>

      </ul>

    </body>

</html>