<div dir="ltr">I'm trying to get the Multiprecision Arithmetic Builtins producing code as effective as longer integer types.<div><br></div><div>Firstly, I've defined some typedefs:</div><div><br></div><div><div>typedef unsigned long long unsigned_word;</div><div>typedef __uint128_t unsigned_128;</div><div><br></div><div>And a result type, that carries two words:</div><div><br></div><div>struct Result</div><div>{</div><div>  unsigned_word lo;</div><div>  unsigned_word hi;</div><div>};</div></div><div><br></div><div>Then I've defined two functions, both that should be functionally the same. They both take 4 words, the low and high bits of a 128bit word, and add them and return the result. Here's the first:</div><div><br></div><div><div>Result f (unsigned_word lo1, unsigned_word hi1, unsigned_word lo2, unsigned_word hi2)</div><div>{</div><div>  Result x;</div><div>  unsigned_128 n1 = lo1 + (static_cast<unsigned_128>(hi1) << 64);</div><div>  unsigned_128 n2 = lo2 + (static_cast<unsigned_128>(hi2) << 64);</div><div>  unsigned_128 r1 = n1 + n2;</div><div>  x.lo = r1 & ((static_cast<unsigned_128>(1) << 64) - 1);</div><div>  x.hi = r1 >> 64;</div><div>  return x;</div><div>}</div></div><div><br></div><div>Which inlines nicely at high optimisation level and produces the following very nice assembly on x86:</div><div><br></div><div><div>movq<span class="" style="white-space:pre">      </span>8(%rsp), %rsi</div><div>movq<span class="" style="white-space:pre">  </span>(%rsp), %rbx</div><div>addq<span class="" style="white-space:pre">   </span>24(%rsp), %rsi</div><div>adcq<span class="" style="white-space:pre"> </span>16(%rsp), %rbx</div></div><div><br></div><div>But then I've attempted to do the same thing with the multi-precision primitives:</div><div><br></div><div><div>Result g (unsigned_word lo1, unsigned_word hi1, unsigned_word lo2, unsigned_word hi2)</div><div>{</div><div>  Result x;</div><div>  unsigned_word carryout;</div><div>  x.lo = __builtin_addcll(lo1, lo2, 0, &carryout);</div><div>  x.hi = __builtin_addcll(hi1, hi2, carryout, &x.carry);</div><div>  return x;</div><div>}</div></div><div><br></div><div>The code above is simpler, but produces worse assembly</div><div><br></div><div><div>movq    24(%rsp), %rsi</div><div>movq    (%rsp), %rbx</div><div>addq    16(%rsp), %rbx // Line 1</div><div>addq    8(%rsp), %rsi</div><div>adcq    $0, %rbx // Line 2</div></div><div><br></div><div>Notice the additional adc of 0, where instead line 1 could be removed and line 2 replaced with:</div><div><br></div><div><div>adcq    16(%rsp), %rbx</div></div><div><br></div><div>This worse code for the mulitprecision builtins actually gets worse above 128bits, because instead of compiling to a chain of "adc"s there's a mix of "ors" etc to save and pass on the carries.</div><div><br></div><div>So it seems that the "multiprecision builtins" are worse at multiprecision than complex bit-fiddling into larger times. However the complex bit-fiddling doesn't generalise well, so I'm wondering if someone can show me how to use these to produce an efficient "addc" chain (as I presume they are intended to). </div></div>