[cfe-dev] Loop optimised into div (?)
Bruce Hoult via cfe-dev
cfe-dev at lists.llvm.org
Mon Oct 28 17:30:24 PDT 2019
On Mon, Oct 28, 2019 at 4:26 PM Robinson, Paul via cfe-dev
<cfe-dev at lists.llvm.org> wrote:
>
>
>
> > -----Original Message-----
> > From: cfe-dev <cfe-dev-bounces at lists.llvm.org> On Behalf Of Joerg
> > Sonnenberger via cfe-dev
> > Sent: Monday, October 28, 2019 6:54 PM
> > To: cfe-dev at lists.llvm.org
> > Subject: Re: [cfe-dev] Loop optimised into div (?)
> >
> > On Mon, Oct 28, 2019 at 03:15:37PM -0700, Bruce Hoult via cfe-dev wrote:
> > > On Mon, Oct 28, 2019 at 2:28 PM Joerg Sonnenberger via cfe-dev
> > > <cfe-dev at lists.llvm.org> wrote:
> > > >
> > > > On Mon, Oct 28, 2019 at 02:17:50PM -0700, Bruce Hoult via cfe-dev
> > wrote:
> > > > > > I found this when compiling a function that will convert a binary
> > number to a sequence of decimal digits expressed in ascii. Every digit is
> > determined by counting the number of subtractions required for power-or-
> > ten factors . For that algorithm, only up to 9 subtractions are required
> > at every step (or factor). Replacing that by divisions is a lot more
> > expensive for the affected targets.
> > > > >
> > > > > Aha. I see, yes.
> > > > >
> > > > > I think the divisions are not as expensive as you might be assuming.
> > > > > The key is to do the same as for the subtact version and divide by
> > > > > 10000, then when the remainder is less than 10000 divide by 1000,
> > and
> > > > > so on. Don't repeatedly divide by 10 as in the recursive algorithm -
> > -
> > > > > that *will* be slow.
> > > >
> > > > It is also important to note that on any somewhat sane architecture,
> > no
> > > > division will actually be created here. Now if your architecture has
> > > > neither a division nor a multiplication in hardware... Even then,
> > > > division by 10 needs ~10 adds and shift in total.
> > >
> > > How do you do that?
> > >
> > > The only ways I know of, division by 10 needs up to wordSize-3
> > > iterations of a loop, with each loop containing two
> > > compare-and-branches, a subtract and an add (or OR) controlled by one
> > > of the compares, and two shifts. That's eight instructions on most
> > > ISAs.
> >
> > Division by ten is essentially multiplication by 0xcccd for 16bit ops
> > and shifting the result by 15 or so. That's multiplying by 0xc (shift
> > and add), 0x1111 = 0x101 * 0x11 (two adds and two shifts).
> > and adding the original value. Roughly at least. Double the number if
> > there is no good way to express 32bit numbers.
> >
> > Joerg
>
> The book _Hacker's Delight_ has a whole chapter on integer division by
> a constant, with proofs, turning into mostly multiplies and shifts.
Here are functions based on multiplying by 0xcccd and probably the
best Hacker's delight algorithm:
uint16_t div10(uint16_t n){
uint32_t p = n * (uint32_t)0xcccd;
return p>>19;
}
uint16_t divu10(uint16_t n) {
uint16_t q, r;
q = (n >> 1) + (n >> 2);
q = q + (q >> 4);
q = q + (q >> 8);
//q = q + (q >> 16);
q = q >> 3;
r = n - (((q << 2) + q) << 1);
return q + (r > 9);
}
I've verified they both work for all values of n by compiling for
x86_64 and risc-v. I can compile for msp430, but I don't have runtime
libraries or an emulator for it.
The MSP430 code generated for the above functions:
00000018 <div10>:
18: 0b 12 push r11
1a: 0a 12 push r10
1c: 09 12 push r9
1e: 08 12 push r8
20: 08 4f mov r15, r8
22: 09 43 clr r9
24: 0c 48 mov r8, r12
26: 0d 49 mov r9, r13
28: 0c 5c rla r12
2a: 0d 6d rlc r13
2c: 0c 58 add r8, r12
2e: 0d 69 addc r9, r13
30: 0c 5c rla r12
32: 0d 6d rlc r13
34: 0c 5c rla r12
36: 0d 6d rlc r13
38: 0e 4c mov r12, r14
3a: 0f 4d mov r13, r15
3c: 0e 5e rla r14
3e: 0f 6f rlc r15
40: 0e 5e rla r14
42: 0f 6f rlc r15
44: 0e 5e rla r14
46: 0f 6f rlc r15
48: 0e 5e rla r14
4a: 0f 6f rlc r15
4c: 0e 5c add r12, r14
4e: 0f 6d addc r13, r15
50: 0a 4e mov r14, r10
52: 0b 4f mov r15, r11
54: 4b ee xor.b r14, r11
56: 0b ee xor r14, r11
58: 8b 10 swpb r11
5a: 4a 4a mov.b r10, r10
5c: 8a 10 swpb r10
5e: 0e 5a add r10, r14
60: 0f 6b addc r11, r15
62: 0e 58 add r8, r14
64: 0f 69 addc r9, r15
66: 0d 4f mov r15, r13
68: 0e 4d mov r13, r14
6a: 0f 43 clr r15
6c: 12 c3 clrc
6e: 0f 10 rrc r15
70: 0e 10 rrc r14
72: 12 c3 clrc
74: 0f 10 rrc r15
76: 0e 10 rrc r14
78: 12 c3 clrc
7a: 0f 10 rrc r15
7c: 0e 10 rrc r14
7e: 0f 4e mov r14, r15
80: 38 41 pop r8
82: 39 41 pop r9
84: 3a 41 pop r10
86: 3b 41 pop r11
88: 30 41 ret
0000008a <divu10>:
8a: 0e 4f mov r15, r14
8c: 12 c3 clrc
8e: 0e 10 rrc r14
90: 12 c3 clrc
92: 0e 10 rrc r14
94: 0d 4f mov r15, r13
96: 12 c3 clrc
98: 0d 10 rrc r13
9a: 0d 5e add r14, r13
9c: 0e 4d mov r13, r14
9e: 12 c3 clrc
a0: 0e 10 rrc r14
a2: 12 c3 clrc
a4: 0e 10 rrc r14
a6: 12 c3 clrc
a8: 0e 10 rrc r14
aa: 12 c3 clrc
ac: 0e 10 rrc r14
ae: 0d 5e add r14, r13
b0: 0e 4d mov r13, r14
b2: 8e 10 swpb r14
b4: 4e 4e mov.b r14, r14
b6: 0e 5d add r13, r14
b8: 12 c3 clrc
ba: 0e 10 rrc r14
bc: 12 c3 clrc
be: 0e 10 rrc r14
c0: 12 c3 clrc
c2: 0e 10 rrc r14
c4: 0d 4e mov r14, r13
c6: 0d 5d rla r13
c8: 0d 5d rla r13
ca: 0d 5e add r14, r13
cc: 0d 5d rla r13
ce: 0f 8d sub r13, r15
d0: 0d 4f mov r15, r13
d2: 1f 43 mov #1, r15 ;r3 As==01
d4: 3d 90 0a 00 cmp #10, r13 ;#0x000a
d8: 01 2c jc $+4 ;abs 0xdc
da: 0f 43 clr r15
dc: 0f 5e add r14, r15
de: 30 41 ret
Not even counting the pushes, pops, and returns, those have 48 and 41
instructions, independent of the input data (except one conditionally
executed clr r15 in the 2nd one).
You will recall that the simple loop that subtracts a power of ten
between 0 and 9 times uses on average 19.5 instructions on MSP430.
Calling the risc-v general purpose udiv library function uses on
average 40 instructions. I don't have the MSP430 library routine, but
it will be similar.
Both of the above methods are just as bad or worse than a generic
division function known to return a value between 0 and 9.
More information about the cfe-dev
mailing list