[LLVMdev] Bignums

Fri Apr 1 14:24:40 PDT 2011

With some more tweaking, LLVM comes within ~10% of GCC.  Still, there's 
a difference.  Here's a test case, which does a 128-bit x 128-bit to 
256-bit multiply.

define private i192 @mac(i192 %accum, i64 %x, i64 %y) alwaysinline {
entry:
   %xx = sext i64 %x to i128
   %yy = sext i64 %y to i128
   %zz = mul i128 %xx, %yy
   %zzz = zext i128 %zz to i192
   %out = add i192 %accum, %zzz
   ret i192 %out
}

define [4 x i64] @fullmul([2 x i64] %x, [2 x i64] %y) {
entry:
   %xlo = extractvalue [2 x i64] %x, 0
   %xhi = extractvalue [2 x i64] %x, 1
   %ylo = extractvalue [2 x i64] %y, 0
   %yhi = extractvalue [2 x i64] %y, 1
   %m0 = call i192 @mac(i192 0, i64 %xlo, i64 %ylo)
   %o0 = trunc i192 %m0 to i64
   %out0 = insertvalue [4 x i64] undef, i64 %o0, 0
   %m1 = lshr i192 %m0, 64
   %m10 = call i192 @mac(i192 %m1,  i64 %xhi, i64 %ylo)
   %m11 = call i192 @mac(i192 %m10, i64 %xlo, i64 %yhi)
   %o1 = trunc i192 %m11 to i64
   %out1 = insertvalue [4 x i64] %out0, i64 %o1, 1
   %m2 = lshr i192 %m11, 64
   %m20 = call i192 @mac(i192 %m2, i64 %xhi, i64 %yhi)
   %o2 = trunc i192 %m20 to i64
   %out2 = insertvalue [4 x i64] %out1, i64 %o2, 2
   %m3 = lshr i192 %m20, 64
   %o3 = trunc i192 %m3 to i64
   %out3 = insertvalue [4 x i64] %out2, i64 %o3, 3
   ret [4x i64] %out3
}

#include <sys/types.h>
extern void fullmul(u_int64_t out[4], u_int64_t a, u_int64_t b, 
u_int64_t c, u_int64_t d);
int main(int argc, char **argv) {
   u_int64_t x[4]={1,2,3,4}, n;
   for (n=0; n<100000000; n++)
     fullmul(x,x[0],x[1],x[2],x[3]);
}

The assembly is:
     movq    %rdx, %r9
     movq    %r9, %rax
     imulq    %rcx
     movq    %rax, %r10
     movq    %rdx, %r11
     movq    %rsi, %rax
     imulq    %rcx
     movq    %rax, (%rdi)
     movq    %rdx, %rcx   ;;; extraneous
     addq    %r10, %rcx
     adcq    $0, %r11
     sbbq    %r10, %r10
     andq    $1, %r10
     movq    %rsi, %rax
     imulq    %r8
     addq    %rcx, %rax
     movq    %rax, 8(%rdi)
     movq    %rdx, %rcx   ;;; extraneous
     adcq    %r11, %rcx
     adcq    $0, %r10
     movq    %r9, %rax
     imulq    %r8
     addq    %rcx, %rax
     movq    %rax, 16(%rdi)
     adcq    %r10, %rdx
     movq    %rdx, 24(%rdi)
     ret

... which has 2 extraneous mov instructions.  On my laptop (1.66GHz 
Conroe), removing the extra movs and renaming registers drops the time 
from 1.95s to 1.75s, of which 0.4s is loop overhead.  The difference is 
less significant on newer hardware (Nehalem).

It's hard to say what impact such extraneous mov's have on a real 
program.  My current testing code uses 4x4 Montgomery multiplies, which 
are longer but have many more extraneous mov's.  So far I've only tested 
the GCC version against the LLVM version, and the GCC version is 5-10% 
faster, but this time the difference is more significant on newer 
hardware, perhaps because the mul's and adc's are less of a bottleneck.  
The difference is further skewed by the fact that the GCC and LLVM 
versions use a slightly different algorithm, and the LLVM code seems to 
optimize slightly better in other ways... so it may be that the extra 
mov's are hurting it more than 5-10%.

Cheers,
Mike

On 03/30/2011 11:24 AM, Mike Hamburg wrote:
> Hello all!
>
> I'm working on a library with bignum support, and I wanted to try LLVM
> as an apparently simpler and more portable system to my current design
> (a Haskell script which spits out mixed C and assembly).  Porting the
> script to use the LLVM bindings instead of the current hack was pretty
> easy.  But I have a few remaining questions:
>
> (1) Are bignums exposed to any higher-level language?  It would be nice
> to be able to write all the code in C and compile it with Clang... but I
> need 256-bit integer support.
>
> (2) Is there a way to convince LLVM's register allocator to do the right
> thing on x86?  I'm getting swaths of code like:
>           movq    88(%rsp), %rax
>           mulq    112(%rsp)
>           movq    %rax, %r15
>           addq    %r11, %r15
>           movq    %rdx, %r14
>           adcq    %rcx, %r14
>           adcq    $0, %r9
> (that's a 64x64 ->  128-bit multiply with 192-bit accumulate.)  The
> problem is, %r11 and %rcx are dead here.  It should have just added %rax
> and %rdx into them.  This results in more movs, more spills, more code
> and less performance.
>
> (3) Is there a way to avoid this whole mess?  I'm using a script to spit
> out ugly code in large part because GCC and Clang both refuse to unroll
> loops that look like
>
> int i,j;
> // accumulator uses inline asm to do the 64x64->128->192-bit accumulate
> above
> accumulator acc(a[0], b[0]);
> tmp[0] = acc.shift();
> for (j=1; j<n; j++) {
>     for (i=0; i<=j; i++)
>       acc.mac(a[i], b[j-i]);
>     tmp[j] = acc.shift();
> }
>
> where n is a template parameter, and thus known at compile time.  Is
> there some clang pass which will unroll this properly?  -O3
> -funroll-loops doesn't seem to (in fact, -funroll-loops makes it
> worse... it tries to unroll the loops by a factor of 4 instead of
> completely unwinding them).  Is there some opt pass which can fix it?
> This is more relevant if clang can do sane things with the registers
> (its performance is half that of GCC right now), but it'd be nice to know.
>
> Thanks for your time!
> -- Mike Hamburg
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev