[cfe-dev] Loop optimised into div (?)

Mon Oct 28 17:30:24 PDT 2019

On Mon, Oct 28, 2019 at 4:26 PM Robinson, Paul via cfe-dev
<cfe-dev at lists.llvm.org> wrote:
>
>
>
> > -----Original Message-----
> > From: cfe-dev <cfe-dev-bounces at lists.llvm.org> On Behalf Of Joerg
> > Sonnenberger via cfe-dev
> > Sent: Monday, October 28, 2019 6:54 PM
> > To: cfe-dev at lists.llvm.org
> > Subject: Re: [cfe-dev] Loop optimised into div (?)
> >
> > On Mon, Oct 28, 2019 at 03:15:37PM -0700, Bruce Hoult via cfe-dev wrote:
> > > On Mon, Oct 28, 2019 at 2:28 PM Joerg Sonnenberger via cfe-dev
> > > <cfe-dev at lists.llvm.org> wrote:
> > > >
> > > > On Mon, Oct 28, 2019 at 02:17:50PM -0700, Bruce Hoult via cfe-dev
> > wrote:
> > > > > > I found this when compiling a function that will convert a binary
> > number to a sequence of decimal digits expressed in ascii. Every digit is
> > determined by counting the number of subtractions required for power-or-
> > ten factors . For that algorithm, only up to 9 subtractions are required
> > at every step (or factor). Replacing that by divisions is a lot more
> > expensive for the affected targets.
> > > > >
> > > > > Aha. I see, yes.
> > > > >
> > > > > I think the divisions are not as expensive as you might be assuming.
> > > > > The key is to do the same as for the subtact version and divide by
> > > > > 10000, then when the remainder is less than 10000 divide by 1000,
> > and
> > > > > so on. Don't repeatedly divide by 10 as in the recursive algorithm -
> > -
> > > > > that *will* be slow.
> > > >
> > > > It is also important to note that on any somewhat sane architecture,
> > no
> > > > division will actually be created here. Now if your architecture has
> > > > neither a division nor a multiplication in hardware... Even then,
> > > > division by 10 needs ~10 adds and shift in total.
> > >
> > > How do you do that?
> > >
> > > The only ways I know of, division by 10 needs up to wordSize-3
> > > iterations of a loop, with each loop containing two
> > > compare-and-branches, a subtract and an add (or OR) controlled by one
> > > of the compares, and two shifts. That's eight instructions on most
> > > ISAs.
> >
> > Division by ten is essentially multiplication by 0xcccd for 16bit ops
> > and shifting the result by 15 or so. That's multiplying by 0xc (shift
> > and add), 0x1111 = 0x101 * 0x11 (two adds and two shifts).
> > and adding the original value. Roughly at least. Double the number if
> > there is no good way to express 32bit numbers.
> >
> > Joerg
>
> The book _Hacker's Delight_ has a whole chapter on integer division by
> a constant, with proofs, turning into mostly multiplies and shifts.

Here are functions based on multiplying by 0xcccd and probably the
best Hacker's delight algorithm:

uint16_t div10(uint16_t n){
  uint32_t p = n * (uint32_t)0xcccd;
  return p>>19;
}

uint16_t divu10(uint16_t n) {
    uint16_t q, r;
    q = (n >> 1) + (n >> 2);
    q = q + (q >> 4);
    q = q + (q >> 8);
    //q = q + (q >> 16);
    q = q >> 3;
    r = n - (((q << 2) + q) << 1);
    return q + (r > 9);
}

I've verified they both work for all values of n by compiling for
x86_64 and risc-v. I can compile for msp430, but I don't have runtime
libraries or an emulator for it.

The MSP430 code generated for the above functions:

00000018 <div10>:
  18: 0b 12        push r11
  1a: 0a 12        push r10
  1c: 09 12        push r9
  1e: 08 12        push r8
  20: 08 4f        mov r15, r8
  22: 09 43        clr r9
  24: 0c 48        mov r8, r12
  26: 0d 49        mov r9, r13
  28: 0c 5c        rla r12
  2a: 0d 6d        rlc r13
  2c: 0c 58        add r8, r12
  2e: 0d 69        addc r9, r13
  30: 0c 5c        rla r12
  32: 0d 6d        rlc r13
  34: 0c 5c        rla r12
  36: 0d 6d        rlc r13
  38: 0e 4c        mov r12, r14
  3a: 0f 4d        mov r13, r15
  3c: 0e 5e        rla r14
  3e: 0f 6f        rlc r15
  40: 0e 5e        rla r14
  42: 0f 6f        rlc r15
  44: 0e 5e        rla r14
  46: 0f 6f        rlc r15
  48: 0e 5e        rla r14
  4a: 0f 6f        rlc r15
  4c: 0e 5c        add r12, r14
  4e: 0f 6d        addc r13, r15
  50: 0a 4e        mov r14, r10
  52: 0b 4f        mov r15, r11
  54: 4b ee        xor.b r14, r11
  56: 0b ee        xor r14, r11
  58: 8b 10        swpb r11
  5a: 4a 4a        mov.b r10, r10
  5c: 8a 10        swpb r10
  5e: 0e 5a        add r10, r14
  60: 0f 6b        addc r11, r15
  62: 0e 58        add r8, r14
  64: 0f 69        addc r9, r15
  66: 0d 4f        mov r15, r13
  68: 0e 4d        mov r13, r14
  6a: 0f 43        clr r15
  6c: 12 c3        clrc
  6e: 0f 10        rrc r15
  70: 0e 10        rrc r14
  72: 12 c3        clrc
  74: 0f 10        rrc r15
  76: 0e 10        rrc r14
  78: 12 c3        clrc
  7a: 0f 10        rrc r15
  7c: 0e 10        rrc r14
  7e: 0f 4e        mov r14, r15
  80: 38 41        pop r8
  82: 39 41        pop r9
  84: 3a 41        pop r10
  86: 3b 41        pop r11
  88: 30 41        ret

0000008a <divu10>:
  8a: 0e 4f        mov r15, r14
  8c: 12 c3        clrc
  8e: 0e 10        rrc r14
  90: 12 c3        clrc
  92: 0e 10        rrc r14
  94: 0d 4f        mov r15, r13
  96: 12 c3        clrc
  98: 0d 10        rrc r13
  9a: 0d 5e        add r14, r13
  9c: 0e 4d        mov r13, r14
  9e: 12 c3        clrc
  a0: 0e 10        rrc r14
  a2: 12 c3        clrc
  a4: 0e 10        rrc r14
  a6: 12 c3        clrc
  a8: 0e 10        rrc r14
  aa: 12 c3        clrc
  ac: 0e 10        rrc r14
  ae: 0d 5e        add r14, r13
  b0: 0e 4d        mov r13, r14
  b2: 8e 10        swpb r14
  b4: 4e 4e        mov.b r14, r14
  b6: 0e 5d        add r13, r14
  b8: 12 c3        clrc
  ba: 0e 10        rrc r14
  bc: 12 c3        clrc
  be: 0e 10        rrc r14
  c0: 12 c3        clrc
  c2: 0e 10        rrc r14
  c4: 0d 4e        mov r14, r13
  c6: 0d 5d        rla r13
  c8: 0d 5d        rla r13
  ca: 0d 5e        add r14, r13
  cc: 0d 5d        rla r13
  ce: 0f 8d        sub r13, r15
  d0: 0d 4f        mov r15, r13
  d2: 1f 43        mov #1, r15 ;r3 As==01
  d4: 3d 90 0a 00 cmp #10, r13 ;#0x000a
  d8: 01 2c        jc $+4      ;abs 0xdc
  da: 0f 43        clr r15
  dc: 0f 5e        add r14, r15
  de: 30 41        ret

Not even counting the pushes, pops, and returns, those have 48 and 41
instructions, independent of the input data (except one conditionally
executed clr r15 in the 2nd one).

You will recall that the simple loop that subtracts a power of ten
between 0 and 9 times uses on average 19.5 instructions on MSP430.
Calling the risc-v general purpose udiv library function uses on
average 40 instructions. I don't have the MSP430 library routine, but
it will be similar.

Both of the above methods are just as bad or worse than a generic
division function known to return a value between 0 and 9.