[LLVMdev] Bignum development

Bill Hart goodwillhart at googlemail.com
Sun Jun 13 08:16:59 PDT 2010


I think from the C compiler's point of view, it is going to want it to
work for any size above an i64, i.e. all the way up to an i128 so that
if the user of the C compiler does this computation with __uint128_t's
then it will Do The Right Thing TM.

Basically, you want

unsigned long a, b, c, d;

....

const __uint128_t u = (__uint128_t) a + b;
const unsigned long v = u >> 64;
const __uint128_t w = (__uint128_t) c + d + v;

to do the right thing, namely do the first addition with an ADD, the
carry flag holding the shift value v, and do the third addition with a
single ADC if available  (assuming there are no instructions between
which kill the carry flag, in which case a byte would have to be
used).

In IR, this is going to look like:

  %0 = load i64* %a, align 8                      ; <i64> [#uses=1]
  %1 = zext i64 %0 to i128                        ; <i128> [#uses=1]
  %2 = load i64* %b, align 8                      ; <i64> [#uses=1]
  %3 = zext i64 %2 to i128                        ; <i128> [#uses=1]
  %4 = add i128 %1, %3                            ; <i128> [#uses=1]

  %5 = lshr i128 %4, 64                           ; <i128> [#uses=1]
  %6 = trunc i128 %5 to i64                       ; <i64> [#uses=1]

  %7 = load i64* %c, align 8                      ; <i64> [#uses=1]
  %8 = zext i64 %7 to i128                        ; <i128> [#uses=1]
  %9 = load i64* %d, align 8                     ; <i64> [#uses=1]
  %10 = zext i64 %9 to i128                      ; <i128> [#uses=1]
  %11 = add i128 %8, %10                          ; <i128> [#uses=1]

  %12 = zext i64 %6 to i128                      ; <i128> [#uses=1]
  %13 = add i128 %11, %12                         ; <i128> [#uses=1]
 .....

But as I mentioned, some care needs to be taken with loop counters so
that comparing them to their limit does not destroy the carry flag.
And of course the whole optimisation needs to work for the similar
thing involving __uint64_t's on 32 bit machines.

It's one hell of an optimisation. But also a radical speedup (33% in a
loop without unrolling, better than 100% with unrolling).

Bill.

On 13 June 2010 13:22, Duncan Sands <baldrick at free.fr> wrote:
>> Yeah I had a think about it, and I think intrinsics are the wrong way
>> to do it. So I'd say you are likely right.
>
> For this to work well, the way the code generators handle flags will need
> to be improved: currently it is suboptimal, in fact kind of a hack.
>
> Ciao,
>
> Duncan.
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>




More information about the llvm-dev mailing list