<html>

    <head>

      <base href="https://bugs.llvm.org/">

    </head>

    <body><table border="1" cellspacing="0" cellpadding="8">

        <tr>

          <th>Bug ID</th>

          <td><a class="bz_bug_link 

          bz_status_NEW "

   title="NEW - [x86] Failure to use both div and mod results of one IDIV in a prime-factor loop while(n%i==0) { n/=i; }"

   href="https://bugs.llvm.org/show_bug.cgi?id=37101">37101</a>

          </td>

        </tr>

        <tr>

          <th>Summary</th>

          <td>[x86] Failure to use both div and mod results of one IDIV in a prime-factor loop while(n%i==0) { n/=i; }

          </td>

        </tr>

        <tr>

          <th>Product</th>

          <td>new-bugs

          </td>

        </tr>

        <tr>

          <th>Version</th>

          <td>trunk

          </td>

        </tr>

        <tr>

          <th>Hardware</th>

          <td>PC

          </td>

        </tr>

        <tr>

          <th>OS</th>

          <td>Linux

          </td>

        </tr>

        <tr>

          <th>Status</th>

          <td>NEW

          </td>

        </tr>

        <tr>

          <th>Keywords</th>

          <td>performance

          </td>

        </tr>

        <tr>

          <th>Severity</th>

          <td>enhancement

          </td>

        </tr>

        <tr>

          <th>Priority</th>

          <td>P

          </td>

        </tr>

        <tr>

          <th>Component</th>

          <td>new bugs

          </td>

        </tr>

        <tr>

          <th>Assignee</th>

          <td>unassignedbugs@nondot.org

          </td>

        </tr>

        <tr>

          <th>Reporter</th>

          <td>peter@cordes.ca

          </td>

        </tr>

        <tr>

          <th>CC</th>

          <td>llvm-bugs@lists.llvm.org

          </td>

        </tr></table>

      <p>

        <div>

        <pre>Near-identical code-gen to <a href="https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85366">https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85366</a>

for this loop.  From

<a href="https://codereview.stackexchange.com/questions/191792/find-prime-factors-in-c/191801#191801">https://codereview.stackexchange.com/questions/191792/find-prime-factors-in-c/191801#191801</a>,

simplified to use a pointer instead of returning std::vector<int>:

void find_prime_factors_ptr(int n, int *p)

{

    // inefficient to test even numbers > 2, but that's a separate missed

optimization.

    for (int i = 2; i <= n; i++) {

        while (n % i == 0) {

            *p++ = i;

            n /= i;   // reordering the loop body doesn't help

        }

    }

}

<a href="https://godbolt.org/g/GLycMM">https://godbolt.org/g/GLycMM</a>  

clang version 7.0.0 (trunk 329780) 

find_prime_factors_ptr(int, int*):

        movl    %edi, %ecx

        cmpl    $2, %edi

        jl      .LBB0_6

        movl    $2, %edi             # why not keep n in EDI, and use ECX for

i?

.LBB0_2:                                # =>This Loop Header: Depth=1

        movl    %ecx, %eax

        cltd

        idivl   %edi

        testl   %edx, %edx

        jne     .LBB0_5

.LBB0_3:                                #   Parent Loop BB0_2 Depth=1

        movl    %edi, (%rsi)

        movl    %ecx, %eax

        cltd

        idivl   %edi

        movl    %eax, %ecx

        cltd

        idivl   %edi

        addq    $4, %rsi

        testl   %edx, %edx

        je      .LBB0_3

.LBB0_5:                                #   in Loop: Header=BB0_2 Depth=1

        leal    1(%rdi), %eax    # this compare-latency optimization 

        cmpl    %ecx, %edi       # doesn't seem worth it

        movl    %eax, %edi       # 4 uops instead of 2 from defeating

macro-fusion

        jl      .LBB0_2          # vs. inc %rdi / cmp %ecx,%edi / jle

.LBB0_6:

        retq

Using idiv twice inside the loop body is totally pointless.  At either way of

reaching .LBB0_3 (fall through or loop), n/i is in EAX, but LLVM ignores it.

The inner loop can be optimized by dropping movl %ecx, %eax / cltd / idiv with

no other changes.

.LBB0_3:

        movl    %edi, (%rsi)

        addq    $4, %rsi

        movl    %eax, %ecx    # n = tmp

        cltd

        idivl   %edi          # eax = tmp = n/i

        testl   %edx, %edx

        je      .LBB0_3

Unlike for gcc, changing the inner loop like this didn't help:

        int tmp;

        while (tmp = n/i, n % i == 0) {

            *p++ = i;

            n = tmp;

        }

(see the linked gcc bug report for an interesting version that jumps into the

inner loop instead of hoisting the run-zero-times check out of the loop before

the first iteration.)

I think one of the versions on the godbolt link did manage to optimize better

with clang, but I haven't investigated further.  (I think it was one of the

versions using `vector.push_back` inside the loop.)</pre>

        </div>

      </p>

      <hr>

      <span>You are receiving this mail because:</span>

      <ul>

          <li>You are on the CC list for the bug.</li>

      </ul>

    </body>

</html>