[llvm-bugs] [Bug 46773] New: Performance regression for sqrt, sqrtf, sqrtl in for loop

Sat Jul 18 09:41:59 PDT 2020

https://bugs.llvm.org/show_bug.cgi?id=46773

            Bug ID: 46773
           Summary: Performance regression for sqrt, sqrtf, sqrtl in for
                    loop
           Product: new-bugs
           Version: 10.0
          Hardware: PC
                OS: Linux
            Status: NEW
          Severity: enhancement
          Priority: P
         Component: new bugs
          Assignee: unassignedbugs at nondot.org
          Reporter: joshstockin at gmail.com
                CC: htmldeveloper at gmail.com, llvm-bugs at lists.llvm.org

clang version 10.0.0-4ubuntu1 
    Target: x86_64-pc-linux-gnu
    Thread model: posix

This might actually be a twofer.

Consider the following function, which calculates the number of primes up to
10,000,000:

    void primes() {
        uint32_t NUM = 10000000;
        uint32_t total = 1;
        for (uint32_t N = 3; N < NUM; N += 2) {
            uint32_t prime = 1;
            for (uint32_t f = 3; f <= (uint32_t)sqrtf(N); f += 2) {
                if (N % f == 0) {
                    prime = 0;
                    break;
                }
            }
            total += prime;
        }
        printf("Total primes: %d\n", total);
    }

Note the for loop with sqrt (this also applies to sqrtf and sqrtl):

    for (uint32_t f = 3; f <= (uint32_t)sqrt(N); f += 2)

First optimization bug: LLVM doesn't seem to recognize sqrt* as pure functions,
and so doesn't pull them out of the condition--instead, the sqrt operation is
called repeatedly for every comparison, tremendously slowing down the program.
(This was compiled with -O3.)

; This source compiled including the sqrt() function
; There are similar results for other sqrt functions
    .L19:
            mov     eax, ebx
            xor     edx, edx
            div     r12d
            test    edx, edx                ; N % f == 0
            je      .L4                     ; true, jump
            movsd   xmm1, QWORD PTR [rsp+8] ; false, continue loop
            add     r12d, 2                 ; f += 2
    .L5:
            movapd  xmm0, xmm1              ; Why do this block again?
            movsd   QWORD PTR [rsp+8], xmm1 ; N hasn't changed, and sqrt*
            call    sqrt                    ; are pure functions.
            pxor    xmm2, xmm2              ; This makes the executable
            cvttsd2si       rax, xmm0       ; ~2x slower than gcc -O3's.
            cmp     r12d, eax
            jbe     .L19

You may have noticed the second optimization bug: the sqrt* call here isn't
optimized down to the processor's sqrt instruction. (These are sqrtsd for
sqrt() and sqrtss for sqrtf(). It doesn't look like there's an instruction for
sqrtl(), but if there is, LLVM should optimize to it.) Also note that there are
extraneous sqrt* calls/instructions in the full compiled output, for whatever
reasons LLVM's optimization engine has, but they're inconsistent. In the same
compiled output there are mixed function calls and instructions for the same
operation.

Some speed benchmarks for emphasis:

    $ time ./primes-clangO3  => 0m5.298s real
    Total primes: 664579

    $ time ./primes-clangO0  => 0m5.701s "
    Total primes: 644579

    $ time ./primes-gccO3    => 0m2.872s "
    Total primes: 664579

    $ time ./primes-gccO0    => 0m5.703s "
    Total primes: 664579

I hope there's an easy fix here.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20200718/69a53277/attachment.html>