[cfe-dev] Loop optimised into div (?)

Mon Oct 28 14:17:50 PDT 2019

On Mon, Oct 28, 2019 at 2:27 AM Joan Lluch <joan.lluch at icloud.com> wrote:
>
> Hi Bruce,
>
> I understand what you say, but the optimisation you suggest requires the compiler to generate run-time checks and to split code into several execution blocks, some of them may not get ever executed, or prevent further optimisation, and it adds some overhead on the best case. In my opinion that can be perceived as some overengineering that may not be required.

No no -- I was not suggesting the compiler add runtime checks. I was
suggesting that if the compiler didn't know the input value then it
should use a divide, and keep the loop only if it knows for sure the
input number is very small -- for example because that code is guarded
by the programmer writing an if() statement as I showed. Or possibly
an assert().

> My preferred approach is to actually give some trust to the user source code and just 'relax' for some targets the optimisations that are known to be uncertain for such targets.

I have sympathy for this position, but at the same time a lot of code
is generated mechanically or using macros, and it's good for the
compiler to clean it up.

> I found this when compiling a function that will convert a binary number to a sequence of decimal digits expressed in ascii. Every digit is determined by counting the number of subtractions required for power-or-ten factors . For that algorithm, only up to 9 subtractions are required at every step (or factor). Replacing that by divisions is a lot more expensive for the affected targets.

Aha. I see, yes.

I think the divisions are not as expensive as you might be assuming.
The key is to do the same as for the subtact version and divide by
10000, then when the remainder is less than 10000 divide by 1000, and
so on. Don't repeatedly divide by 10 as in the recursive algorithm --
that *will* be slow.

I don't have an MSP430 and libraries handy, but more or less
understand the instruction set, and I can tell LLVM to generate
assembly language for it.

I had a look at the actual code for RISC-V rv32i. If I force compiling
to a subtraction loop then the complete countHundred(num) function
executes 5 + 3 * floor(num/100) instructions. On MSP430 or ARM it
would be  5 + 4 * floor(num/100) as compare-and-branch needs two
instructions on those.

With the udiv the complete countHundred(num) function executes 20 + 10
* ceil(log2(num/100)) instructions on rv32i using Newlib.

Averaged over all values of num from 0 to 999, the iterative version
uses 18.5 instructions per call to countHundred() while the udiv
version uses an average of 39.8 instructions. The udiv version is
faster for all values of num >= 2200 (which I know won't happen in
your case, but the compiler doesn't know that).

So in this use-case the iterative subtraction is a little over twice
faster. That's *something*, for sure, but it's not like there's an
order of magnitude difference or anything like that. Unless it's very
critical you're unlikely to notice.

I expect results would be roughly similar for the MSP430 version of udiv.