[LLVMdev] Bignum development
Bill Hart
goodwillhart at googlemail.com
Sat Jun 12 11:37:12 PDT 2010
On 12 June 2010 00:51, Eli Friedman <eli.friedman at gmail.com> wrote:
> On Fri, Jun 11, 2010 at 3:28 PM, Bill Hart <goodwillhart at googlemail.com> wrote:
>> Hi Eli,
>>
>> On 11 June 2010 22:44, Eli Friedman <eli.friedman at gmail.com> wrote:
>>> On Fri, Jun 11, 2010 at 10:37 AM, Bill Hart <goodwillhart at googlemail.com> wrote:
>>>> a) What plans are there to support addition, subtraction,
>>>> multiplication, division, shifting, logical not and other logic ops
>>>> for >= 64/128 bits (machine word size, whatever)? The documentation
>>>> mentions integers of millions of bits....
>>>
>>> Pretty much everything besides muliplication and division should work
>>> at any width; the result is generally not especially efficient,
>>> though.
>>
>> OK, I only tried multiplication last night when I first had a decent
>> fiddle (been marking exams, so haven't had much time to really get
>> dirty hands, sorry).
>>
>> Of course efficiency is imperative for bignum stuff as it underpins
>> absolutely behemoth projects like Sage (http://www.sagemath.org).
>> Every cycle counts, so to speak. One cycle per limb difference in
>> multiplication makes a nearly 40% difference in absolutely everything
>> for literally many thousands of developers!
>
> Right; improvements are definitely welcome here; you might want to
> take a look at what we currently generate here, though, to see what
> improvements are most necessary.
>
I think I have an example of why it is somehow important to be able to
retrieve the carry and do an add with carry. Consider this short C
program:
#include <stdlib.h>
#include <stdio.h>
#define BITS 64
/****************************************
Types
****************************************/
typedef unsigned long ul;
typedef __uint128_t ull;
typedef ulong mp_size;
typedef const ulong * mp_src;
typedef ulong * mp_dst;
typedef ulong * mp_ptr;
/****************************************
Random routines
****************************************/
ull __randval = (ull) 13993185049168412078UL;
const ull __randprime = (ull) 9223372036854775814UL * 2 + 1;
const ull __randmult = 18148508189596611927UL;
ul ul_randlimb(void)
{
__randval = (__randval * __randmult) % __randprime;
return (ul) __randval;
}
/****************************************
Unsigned multiple precision routines
****************************************/
mp_ptr mp_init(mp_size n)
{
return malloc(n*sizeof(ul));
}
static inline
ul mp_add_nc(mp_dst r, mp_src a, mp_src b, mp_size n)
{
long i;
ul cy;
const __uint128_t v = (__uint128_t) a[0] + b[0];
r[0] = (ul) v;
cy = v >> 64;
for (i = 1; i < n; i++) {
__uint128_t u = (__uint128_t) a[i] + b[i] + cy;
r[i] = (ul) u;
cy = u >> BITS;
}
return cy;
}
void mp_rand_n(mp_dst r, mp_size n)
{
mp_size i;
for (i = 0; i < n; i++)
r[i] = ul_randlimb();
}
int main(void)
{
mp_ptr a, b, c;
ul i;
a = mp_init(1000);
b = mp_init(1000);
c = mp_init(1000);
mp_rand_n(a, 1000);
mp_rand_n(b, 1000);
for (i = 0; i < 2400000; i++)
mp_add_nc(c, a, b, 1000);
return 0;
}
Ignore all of it except the mp_add_nc function. Now this runs at 4
cycles per int64 addition on AMD K10. If I fiddle a bit and loop
unroll, I get 2.5 cycles. But optimal is actually 1.6 cycles.
The part of the loop in question becomes:
%tmp.i = add i64 %indvar.i, 1 ; <i64> [#uses=2]
%22 = load i64* %scevgep.i, align 8 ; <i64> [#uses=1]
%23 = zext i64 %22 to i128 ; <i128> [#uses=1]
%24 = load i64* %scevgep3.i, align 8 ; <i64> [#uses=1]
%25 = zext i64 %24 to i128 ; <i128> [#uses=1]
%26 = zext i64 %cy.02.i to i128 ; <i128> [#uses=1]
%27 = add i128 %23, %26 ; <i128> [#uses=1]
%28 = add i128 %27, %25 ; <i128> [#uses=2]
%29 = trunc i128 %28 to i64 ; <i64> [#uses=1]
store i64 %29, i64* %scevgep4.i, align 8
%30 = lshr i128 %28, 64 ; <i128> [#uses=1]
%31 = trunc i128 %30 to i64 ; <i64> [#uses=1]
%exitcond = icmp eq i64 %tmp.i, 999 ; <i1> [#uses=1]
In other words, it just extends everything to an i128 and adds.
There's no way to tell it that it can add a[i], b[i] and cy with a
single adc. (Well it could if the loop iteration wasn't messing with
the carry flag).
Indeed, this compiles to:
.LBB1_7: # %bb.i
# Parent Loop BB1_6 Depth=1
# => This Inner Loop Header: Depth=2
addq (%rbx,%rsi,8), %rdi
movl $0, %r8d
adcq $0, %r8
addq (%r14,%rsi,8), %rdi
adcq $0, %r8
movq %rdi, (%r15,%rsi,8)
incq %rsi
cmpq $1000, %rsi # imm = 0x3E8
movq %r8, %rdi
jne .LBB1_7
So it basically tries to keep track of the carry in %r8 instead of in
the carry flag.
As hinted, the other optimisation missed here, is that instead of
comparing with $1000 it can start with %rsi at $-1000 and increment
each iteration and simply do a jnz .LBB1_7 doing away with the cmpq.
I've tried tricking it into doing this in every way I can think of,
but without success so far.
Bill.
More information about the llvm-dev
mailing list