[cfe-dev] Loop optimised into div (?)

Mon Oct 28 16:54:38 PDT 2019

Hi Bruce,

I appreciate your testing of this for the rv32i target. I am doing some work on 8 bit and 16 bit targets. For these targets, I do not agree that it is best to replace loops by divisions even if the compiler doesn’t know the number of iterations. I’m not really aware of  “automatically generated” code, in particular for these targets, but if that causes a problem, then maybe we must blame the “automator”, not the compiler. In any case, I’m pretty sure that most code for these targets is implemented by humans. 

So my preferred approach is still to trust the user source code and to avoid the introduction of expensive target operations if the user has not explicitly used them (division is very expensive for 8 bit targets with no hardware divide). If the user implemented a loop that could be obviously turned into a division, then there are high chances than the loop is better than the division, on THAT particular scenario. Any compiler attempts to reverse this will most probably result in worse performance. A compiler is just a tool, it should not attempt to replace human designed algorithms, but just to generate the best possible assembly for the original source code.

Please also understand that I only used the ‘countHundred’ function to show the subject. I strongly doubt that a normal programmer would use that function to implement a divide-by-one-hundred as opposed to an actual division, so the point remains that if a programmer wanted to use a loop then there must be a good reason for that, which the compiler should respect. I’m talking at all times about targets with no hardware divide, of course for the remaining targets the current transformation is not a problem. 

John

> On 28 Oct 2019, at 22:17, Bruce Hoult <brucehoult at sifive.com> wrote:
> 
> On Mon, Oct 28, 2019 at 2:27 AM Joan Lluch <joan.lluch at icloud.com> wrote:
>> 
>> Hi Bruce,
>> 
>> I understand what you say, but the optimisation you suggest requires the compiler to generate run-time checks and to split code into several execution blocks, some of them may not get ever executed, or prevent further optimisation, and it adds some overhead on the best case. In my opinion that can be perceived as some overengineering that may not be required.
> 
> No no -- I was not suggesting the compiler add runtime checks. I was
> suggesting that if the compiler didn't know the input value then it
> should use a divide, and keep the loop only if it knows for sure the
> input number is very small -- for example because that code is guarded
> by the programmer writing an if() statement as I showed. Or possibly
> an assert().
> 
>> My preferred approach is to actually give some trust to the user source code and just 'relax' for some targets the optimisations that are known to be uncertain for such targets.
> 
> I have sympathy for this position, but at the same time a lot of code
> is generated mechanically or using macros, and it's good for the
> compiler to clean it up.
> 
>> I found this when compiling a function that will convert a binary number to a sequence of decimal digits expressed in ascii. Every digit is determined by counting the number of subtractions required for power-or-ten factors . For that algorithm, only up to 9 subtractions are required at every step (or factor). Replacing that by divisions is a lot more expensive for the affected targets.
> 
> Aha. I see, yes.
> 
> I think the divisions are not as expensive as you might be assuming.
> The key is to do the same as for the subtact version and divide by
> 10000, then when the remainder is less than 10000 divide by 1000, and
> so on. Don't repeatedly divide by 10 as in the recursive algorithm --
> that *will* be slow.
> 
> I don't have an MSP430 and libraries handy, but more or less
> understand the instruction set, and I can tell LLVM to generate
> assembly language for it.
> 
> I had a look at the actual code for RISC-V rv32i. If I force compiling
> to a subtraction loop then the complete countHundred(num) function
> executes 5 + 3 * floor(num/100) instructions. On MSP430 or ARM it
> would be  5 + 4 * floor(num/100) as compare-and-branch needs two
> instructions on those.
> 
> With the udiv the complete countHundred(num) function executes 20 + 10
> * ceil(log2(num/100)) instructions on rv32i using Newlib.
> 
> Averaged over all values of num from 0 to 999, the iterative version
> uses 18.5 instructions per call to countHundred() while the udiv
> version uses an average of 39.8 instructions. The udiv version is
> faster for all values of num >= 2200 (which I know won't happen in
> your case, but the compiler doesn't know that).
> 
> So in this use-case the iterative subtraction is a little over twice
> faster. That's *something*, for sure, but it's not like there's an
> order of magnitude difference or anything like that. Unless it's very
> critical you're unlikely to notice.
> 
> I expect results would be roughly similar for the MSP430 version of udiv.