<html>

    <head>

      <base href="https://bugs.llvm.org/">

    </head>

    <body><table border="1" cellspacing="0" cellpadding="8">

        <tr>

          <th>Bug ID</th>

          <td><a class="bz_bug_link 

          bz_status_NEW "

   title="NEW - Missed optimization : Loop unrolling causes inefficient use of `adc` as compared to loop rolling"

   href="https://bugs.llvm.org/show_bug.cgi?id=44460">44460</a>

          </td>

        </tr>

        <tr>

          <th>Summary</th>

          <td>Missed optimization : Loop unrolling causes inefficient use of `adc` as compared to loop rolling

          </td>

        </tr>

        <tr>

          <th>Product</th>

          <td>new-bugs

          </td>

        </tr>

        <tr>

          <th>Version</th>

          <td>9.0

          </td>

        </tr>

        <tr>

          <th>Hardware</th>

          <td>PC

          </td>

        </tr>

        <tr>

          <th>OS</th>

          <td>Linux

          </td>

        </tr>

        <tr>

          <th>Status</th>

          <td>NEW

          </td>

        </tr>

        <tr>

          <th>Severity</th>

          <td>enhancement

          </td>

        </tr>

        <tr>

          <th>Priority</th>

          <td>P

          </td>

        </tr>

        <tr>

          <th>Component</th>

          <td>new bugs

          </td>

        </tr>

        <tr>

          <th>Assignee</th>

          <td>unassignedbugs@nondot.org

          </td>

        </tr>

        <tr>

          <th>Reporter</th>

          <td>madhur4127@gmail.com

          </td>

        </tr>

        <tr>

          <th>CC</th>

          <td>htmldeveloper@gmail.com, llvm-bugs@lists.llvm.org

          </td>

        </tr></table>

      <p>

        <div>

        <pre>Consider emulating 192-bit integer using a 128-bit integer and a 64-bit

integer. In the code sample this emulated integer is used to compute dot

product of two uint64_t vectors of length N. 

    // function to compute dot product of two vectors

    using u128 = unsigned __int128;

    const int N = 2048;

    uint64_t a[N], b[N];

    u128 sum = 0;

    uint64_t overflow = 0;

    for(int i=0;i<N;++i){

        u128 prod = (u128) a[i] * (u128) b[i];

        sum += prod;

        // gcc branches, clang just uses: adc overflow, 0

        overflow += sum<prod;

    }

To check for overflow in 128-bit and subsequently propagate the carry to

`overflow`, `adc` can be used. This idiom works well when loops are rolled

(no-unroll).

clang++ -O3 -Wall -Wextra -march=broadwell -fno-unroll-loops

.LBB0_1:                                # =>This Inner Loop Header: Depth=1

        mov     rax, qword ptr [rsi + 8*rcx]

        mul     qword ptr [rdi + 8*rcx]

        add     r10, rax

        adc     r9, rdx

        adc     r11, 0                  # This is efficient form

        inc     rcx

        cmp     rcx, 2048

        jne     .LBB0_1

        mov     qword ptr [r8], r11

        mov     rax, r10

        mov     rdx, r9

        ret

------

But when loops are unrolled this efficient ASM degrades to `mov; setb; movzx;

add;` Instead of just `adc reg, 0`.

clang++ -O3 -Wall -Wextra -march=broadwell  # fno-unroll-loops is absent

.LBB0_1:                                # =>This Inner Loop Header: Depth=1

        mov     rax, qword ptr [rsi + 8*rbx]

        mov     r10, qword ptr [rsi + 8*rbx + 8]

        mul     qword ptr [rdi + 8*rbx]

        mov     r11, rdx

        mov     r14, rax

        add     r14, r9

        adc     r11, rcx

        setb    bpl

        mov     rax, r10

        mul     qword ptr [rdi + 8*rbx + 8]

        mov     rcx, rax

        mov     r9, rdx

        movzx   ebp, bpl

        add     rcx, r14

        adc     r9, r11

        adc     rbp, r15

        mov     rax, qword ptr [rsi + 8*rbx + 16]

        mul     qword ptr [rdi + 8*rbx + 16]

        mov     r10, rdx

        mov     r11, rax

        add     r11, rcx

        adc     r10, r9

        setb    cl

        mov     rax, qword ptr [rsi + 8*rbx + 24]

        mul     qword ptr [rdi + 8*rbx + 24]

        movzx   r15d, cl

        mov     r9, rax

        add     r9, r11

        mov     rcx, rdx

        adc     rcx, r10

        adc     r15, rbp

        add     rbx, 4

        cmp     rbx, 2048

        jne     .LBB0_1

For complete source code, here is the godbolt link:

<a href="https://godbolt.org/z/tT7Z2H">https://godbolt.org/z/tT7Z2H</a>

Source of this discussion is the stackoverflow Q&A:

<a href="https://stackoverflow.com/q/59575408/8199790">https://stackoverflow.com/q/59575408/8199790</a></pre>

        </div>

      </p>

      <hr>

      <span>You are receiving this mail because:</span>

      <ul>

          <li>You are on the CC list for the bug.</li>

      </ul>

    </body>

</html>