[cfe-users] Using Multiprecision Arithmetic Builtins effectively

Fri Nov 13 03:00:26 PST 2015

I'm trying to get the Multiprecision Arithmetic Builtins producing code as
effective as longer integer types.

Firstly, I've defined some typedefs:

typedef unsigned long long unsigned_word;
typedef __uint128_t unsigned_128;

And a result type, that carries two words:

struct Result
{
  unsigned_word lo;
  unsigned_word hi;
};

Then I've defined two functions, both that should be functionally the same.
They both take 4 words, the low and high bits of a 128bit word, and add
them and return the result. Here's the first:

Result f (unsigned_word lo1, unsigned_word hi1, unsigned_word lo2,
unsigned_word hi2)
{
  Result x;
  unsigned_128 n1 = lo1 + (static_cast<unsigned_128>(hi1) << 64);
  unsigned_128 n2 = lo2 + (static_cast<unsigned_128>(hi2) << 64);
  unsigned_128 r1 = n1 + n2;
  x.lo = r1 & ((static_cast<unsigned_128>(1) << 64) - 1);
  x.hi = r1 >> 64;
  return x;
}

Which inlines nicely at high optimisation level and produces the following
very nice assembly on x86:

movq 8(%rsp), %rsi
movq (%rsp), %rbx
addq 24(%rsp), %rsi
adcq 16(%rsp), %rbx

But then I've attempted to do the same thing with the multi-precision
primitives:

Result g (unsigned_word lo1, unsigned_word hi1, unsigned_word lo2,
unsigned_word hi2)
{
  Result x;
  unsigned_word carryout;
  x.lo = __builtin_addcll(lo1, lo2, 0, &carryout);
  x.hi = __builtin_addcll(hi1, hi2, carryout, &x.carry);
  return x;
}

The code above is simpler, but produces worse assembly

movq    24(%rsp), %rsi
movq    (%rsp), %rbx
addq    16(%rsp), %rbx // Line 1
addq    8(%rsp), %rsi
adcq    $0, %rbx // Line 2

Notice the additional adc of 0, where instead line 1 could be removed and
line 2 replaced with:

adcq    16(%rsp), %rbx

This worse code for the mulitprecision builtins actually gets worse above
128bits, because instead of compiling to a chain of "adc"s there's a mix of
"ors" etc to save and pass on the carries.

So it seems that the "multiprecision builtins" are worse at multiprecision
than complex bit-fiddling into larger times. However the complex
bit-fiddling doesn't generalise well, so I'm wondering if someone can show
me how to use these to produce an efficient "addc" chain (as I presume they
are intended to).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-users/attachments/20151113/8fb849fc/attachment.html>