[cfe-users] Using Multiprecision Arithmetic Builtins effectively
Clinton Mead via cfe-users
cfe-users at lists.llvm.org
Fri Nov 13 03:00:26 PST 2015
I'm trying to get the Multiprecision Arithmetic Builtins producing code as
effective as longer integer types.
Firstly, I've defined some typedefs:
typedef unsigned long long unsigned_word;
typedef __uint128_t unsigned_128;
And a result type, that carries two words:
struct Result
{
unsigned_word lo;
unsigned_word hi;
};
Then I've defined two functions, both that should be functionally the same.
They both take 4 words, the low and high bits of a 128bit word, and add
them and return the result. Here's the first:
Result f (unsigned_word lo1, unsigned_word hi1, unsigned_word lo2,
unsigned_word hi2)
{
Result x;
unsigned_128 n1 = lo1 + (static_cast<unsigned_128>(hi1) << 64);
unsigned_128 n2 = lo2 + (static_cast<unsigned_128>(hi2) << 64);
unsigned_128 r1 = n1 + n2;
x.lo = r1 & ((static_cast<unsigned_128>(1) << 64) - 1);
x.hi = r1 >> 64;
return x;
}
Which inlines nicely at high optimisation level and produces the following
very nice assembly on x86:
movq 8(%rsp), %rsi
movq (%rsp), %rbx
addq 24(%rsp), %rsi
adcq 16(%rsp), %rbx
But then I've attempted to do the same thing with the multi-precision
primitives:
Result g (unsigned_word lo1, unsigned_word hi1, unsigned_word lo2,
unsigned_word hi2)
{
Result x;
unsigned_word carryout;
x.lo = __builtin_addcll(lo1, lo2, 0, &carryout);
x.hi = __builtin_addcll(hi1, hi2, carryout, &x.carry);
return x;
}
The code above is simpler, but produces worse assembly
movq 24(%rsp), %rsi
movq (%rsp), %rbx
addq 16(%rsp), %rbx // Line 1
addq 8(%rsp), %rsi
adcq $0, %rbx // Line 2
Notice the additional adc of 0, where instead line 1 could be removed and
line 2 replaced with:
adcq 16(%rsp), %rbx
This worse code for the mulitprecision builtins actually gets worse above
128bits, because instead of compiling to a chain of "adc"s there's a mix of
"ors" etc to save and pass on the carries.
So it seems that the "multiprecision builtins" are worse at multiprecision
than complex bit-fiddling into larger times. However the complex
bit-fiddling doesn't generalise well, so I'm wondering if someone can show
me how to use these to produce an efficient "addc" chain (as I presume they
are intended to).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-users/attachments/20151113/8fb849fc/attachment.html>
More information about the cfe-users
mailing list