<table border="1" cellspacing="0" cellpadding="8">

    <tr>

        <th>Issue</th>

        <td>

            <a href=https://github.com/llvm/llvm-project/issues/83030>83030</a>

        </td>

    </tr>

    <tr>

        <th>Summary</th>

        <td>

            Multi-precision addition using `__int128`/`i128` doesn't produce optimal code

        </td>

    </tr>

    <tr>

      <th>Labels</th>

      <td>

            new issue

      </td>

    </tr>

    <tr>

      <th>Assignees</th>

      <td>

      </td>

    </tr>

    <tr>

      <th>Reporter</th>

      <td>

          Eisenwave

      </td>

    </tr>

</table>

<pre>

    https://godbolt.org/z/vovWPxexE

## Expected output

Firstly, the expected "canonical" output which we got both for `_BitInt(512)` additions and `__builtin_addcll` looks as follows:

```cpp

using u128 = unsigned __int128;

using u64 = unsigned long long;

void add_adc(u64* __restrict dest, const u64* a, const u64* b)

{

    u64 carry = 0;

    for (int i = 0; i < 8; ++i) {

 dest[i] = __builtin_addcll(a[i], b[i], carry, &carry);

 }

}

```

```asm

add_adc(unsigned long long*, unsigned long long const*, unsigned long long const*): # @add_adc(unsigned long long*, unsigned long long const*, unsigned long long const*)

  mov rax, qword ptr [rsi]

  add rax, qword ptr [rdx]

  mov qword ptr [rdi], rax

  mov rax, qword ptr [rsi + 8]

  adc rax, qword ptr [rdx + 8]

  mov qword ptr [rdi + 8], rax

  mov rax, qword ptr [rsi + 16]

  adc rax, qword ptr [rdx + 16]

  mov qword ptr [rdi + 16], rax

  mov rax, qword ptr [rsi + 24]

  adc rax, qword ptr [rdx + 24]

  mov qword ptr [rdi + 24], rax

 mov rax, qword ptr [rsi + 32]

  adc rax, qword ptr [rdx + 32]

  mov qword ptr [rdi + 32], rax

  mov rax, qword ptr [rsi + 40]

  adc rax, qword ptr [rdx + 40]

  mov qword ptr [rdi + 40], rax

  mov rax, qword ptr [rsi + 48]

  adc rax, qword ptr [rdx + 48]

  mov qword ptr [rdi + 48], rax

  mov rax, qword ptr [rsi + 56]

  adc rax, qword ptr [rdx + 56]

  mov qword ptr [rdi + 56], rax

  ret

```

We get one `adc` per loop iteration, which is the theoretical optimum here.

## Actual output for 128-bit integers

When attempting to get the same codegen with 128-bit addition, we don't get the optimum:

```cpp

void add_128(u64* __restrict dest, const u64* a, const u64* b)

{

    u64 carry = 0;

    for (int i = 0; i < 8; ++i) {

        auto sum = u128(carry) + a[i] + b[i];

        dest[i] = sum;

        carry = sum >> 64;

    }

}

```

This compiles to:

```asm

add_128(unsigned long long*, unsigned long long const*, unsigned long long const*):

  mov rax, qword ptr [rsi]

  add rax, qword ptr [rdx]

  mov qword ptr [rdi], rax

  mov rax, qword ptr [rsi + 8]

 adc rax, 0

  setb cl

  movzx ecx, cl

  add rax, qword ptr [rdx + 8]

  mov qword ptr [rdi + 8], rax

  adc rcx, qword ptr [rsi + 16]

 setb al

  movzx eax, al

  add rcx, qword ptr [rdx + 16]

  mov qword ptr [rdi + 16], rcx

  adc rax, qword ptr [rsi + 24]

  setb cl

 movzx ecx, cl

  add rax, qword ptr [rdx + 24]

  mov qword ptr [rdi + 24], rax

  adc rcx, qword ptr [rsi + 32]

  setb al

  movzx eax, al

  add rcx, qword ptr [rdx + 32]

  mov qword ptr [rdi + 32], rcx

 adc rax, qword ptr [rsi + 40]

  setb cl

  movzx ecx, cl

  add rax, qword ptr [rdx + 40]

  mov qword ptr [rdi + 40], rax

  adc rcx, qword ptr [rsi + 48]

  setb al

  movzx eax, al

  add rcx, qword ptr [rdx + 48]

  mov qword ptr [rdi + 48], rcx

  adc rax, qword ptr [rsi + 56]

 add rax, qword ptr [rdx + 56]

  mov qword ptr [rdi + 56], rax

 ret

```

LLVM emits `adc` here, however, two per loop iteration.

Basically, `u128(carry) + a[i] + b[i]` turns into:

```asm

  adc rcx, qword ptr [rsi + 16] ; TMP = u128(carry) + a[i]

  setb al                       ; carry = CARRY_FLAG

  movzx eax, al                 ; carry = ZERO_EXTEND(carry)

  add rcx, qword ptr [rdx + 16] ; SUM = TMP + b[i]

  mov qword ptr [rdi + 16], rcx ; dest[i] = SUM

  adc rax, qword ptr [rsi + 24] ; TMP = u128(carry) + a[i]

; ...

```

Well, it doesn't quite translate into that and it's a bit interleaved, but overall, we need two additions per iteration and instead of keeping the carry in the carry CPU flag at all times, it's being "spilled" into a register.

## Relevant passes

Prior to x86 instruction selection, the `adc` variant still uses two additions per iteration:

```llvm

  %8 = tail call { i64, i1 } @llvm.uadd.with.overflow.i64(i64 %6, i64 %7)

  %9 = extractvalue { i64, i1 } %8, 1

  %10 = extractvalue { i64, i1 } %8, 0

  %11 = tail call { i64, i1 } @llvm.uadd.with.overflow.i64(i64 %10, i64 %5)

  %12 = extractvalue { i64, i1 } %11, 1

  %13 = extractvalue { i64, i1 } %11, 0

```

x86-isel combines these into

```llvm

  %4:gr64 = ADD64rm %3:gr64(tied-def 0), %2:gr64, 1, $noreg, 0, $noreg, implicit-def $eflags :: (load (s64) from %ir.b); example.cpp:18:19

```

The 128-bit variant prior to instruction selection looks like this, per iteration:

```llvm

 %conv.1 = zext i1 %ov to i128

  %arrayidx.1 = getelementptr inbounds i8, ptr %a, i64 8

  %3 = load i64, ptr %arrayidx.1, align 8

  %conv1.1 = zext i64 %3 to i128

  %add.1 = add nuw nsw i128 %conv1.1, %conv.1

 %arrayidx3.1 = getelementptr inbounds i8, ptr %b, i64 8

  %4 = load i64, ptr %arrayidx3.1, align 8

  %conv4.1 = zext i64 %4 to i128

```

Presumably, the `i128` addition would already need to be broken up into separate `uadd.with.overflow` calls prior to instruction selection to get the optimal codegen.

</pre>

<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJzUWU9z4joS_zTi0hXKlv9gDhzIJGxt1czu1LyZnd29ULLVgHaE5SfJQObTb0k2YBMS4rx5h5dKgZG6W79utfRrycwYsS4RZyS5J8nDiNV2o_TsURgs92yHo1zxp9nG2sqQaE7ogtDFWvFcSTtWek3o4iehi53aff98wMMjCR5IMG8_aURoBI-HCguLHFRtq9p2JRZCGyufCP0AdoOAR0lCacFKVYqCSUJpqwn7jSg2sEdYKwu5shtYKQ0kDZb3wv69tIRmSUgJnZI0AMa5sEKVBljJvdAyr4W0olwyzgspnZBU6ocBZmClpFR772KDLw2a_6KqmpbaiHINdUgzINED1KWPG4flUpQ2pBmJ7nuCadyXk6pc-4-TYPO5U4I7sEvGC0KzOo0JncNyqdFYLQoLHI11ESpUaSy0_exZS-78buxO2gEAwOMomNZPHk1wGtz1-ejRTJQWxKnbP34A5w8Qek_ovSB0CmejHk9yL0jy4JWexZVmrO13IPPOswfiHghN2-fpGRGZPBwdeLiYhYufzGyblk7cnseZzt1QzzuauL2le0qiObgcJnHwpw_VTstW7UCzg1P4fa80h8pqIMm9Nj6KrRTj_LoUP3SknK2L7uNUOOXbI7oMgKw3bvHSuM9kr41-FhqGIUwHgOgJv4iikRoGg8YDYPSEX4TRSPVg3EIR0QEoesIvomikhgUjDgbA6Am_CKORGghjSH7Gb0rQ-B0ZmgzJ0ORNGZpcyVCN9ure-B1hjRZUiY7r3EaVBlChdhRXgbComWNDZ62hUWE859oNKo3WMS2oyoptvYUNahxfYfJ5YWsn1rCxY4-QZne5sCBKi2vUpqv0fYMlMGtxW1nHiFZ5hG5Qw7YIheK4xhL2wm5Oho6s7XEicPc0sSfFFuErNH2iU0fJfwU6bf9YbRWYetsUDQ34I0X6ZGAnzqX3Z1KNLsxckrNxwbqQOeNvxnsk0SOkcU_uJhd_3QgDhdpWQqIBq55PSY-k29n4s0j6L0GenU0hOKobtDkUsmPt5wGw8ELn5lcA_yHW9YiKt7GuR8oukTagWB_pNYvvpObicHNLvUbN_bC-M6rv5u-bce1x8q8J7FCaPwX2Vlx7xP1r0vX9tcCtwPbo_dcEdmjF8PaM7RYBt0L27oLhpXrh48d_fQLcCms69YIjfqe-UXvcofbn8r26Ukb0qoN7Zlz50JzjSRq8nb3SAGytS-Pqh9cY5I07FTiS_frp800OvcgQuP7nrJ258sP8y5f_LBcf5397IaFu6P_38cs_l4___vr4j4cOrEH7prf427dP3p73sxvMIZuqt3RZK_z27dOgzXZguJ30eDx-oXqV0g0mLHCFpin6fq-FRbCalUYyiz5JwG6Y9Tc6whI6McDgWH9qiWyH3N851BbUDjVrrO4RSkTuc_l8LeSy-pTQjcnSWGQc1Ap-IFa-aN1gO4ei7Pz48PkbrCRbgwMjJVixRdM44FHl6JQJpaYSUjpQtIHPQONaGIv6WoH9BSXuWGmhYsZgr5r-rIXSroY-ZKkHquvCAzcosTiWzA7heTnvmBbOmrFCSqiNK9ReDsHz9Sfl7rQACU2aiy_LhAS33F0NC8JVyx9AhK5iBBJ7nXHNOB-7un7sZmEl1X7sBTORxs5S6nWa50lnFRCaTP0geLCaFXbHZI3XxqFJ5n6GHcUwGKIZdDXDX-ZYGHQ8S_qehfSNAMPwmW_RINXL67Lm5yFL74RB6ar2XJToz3-mWVavT3xMovlatxea84eHNNZb1x617YRmViC_47iCwDntb_gSeup23vi2uFQa1w3GiwaxraQohPVWCI3RrS8DLiv9JVwmFePu2ziLU1hp5UEIPc6ba0TAA9tWEsfuMBjNw8x9TK_fJLZnGDwdPI9rpTqus6trrL0wluKHOzkLv-TfvogITQpV7sZNvv3Eg_VTRxO18yO6LfQcdaY1exL80Iqv0aLELZbW7cWizFVdcgPCp7PfnmnCjtnXtdPkjo9emy9H6dMADYGJddlTdFjDHtgmr6NrYDlvJR2TlfUeSrP3Qh1LbV40MTjH5IgjGuBpfs3T-Jan0WuuxldcjfuuXmTRZ42m3rL8_AaDpIGX7rx_gL2qJQcmNTL-1BKRghwh1-oHllBXDTUYrJh2POdKqGf7jDPpNidzK0M71yz-toTJ403LeMRnEZ9GUzbCWTgJsmmaTGgy2syydBUEeRJjPo15lE94nE1xyqZpGPKYBvlIzGhA44DSNExjGiTjSRRNshWdJClfZVjkJA5wy4Qc-21S6fVIGFPjLIuCKBhJlqM0_hUTpSXuwXcS6k4jIz1zOnd5vTZumxXGmrMVK6zE2adaWnFXaSyE8WR9jG3zrsW_3GnfwqQBoYvONJzLiUorXhf9qIxqLS9fbgm7qfNxobaELvzibb7uKq3-h4UldOHRG0IX3rv_BwAA__-p1de4">