[cfe-dev] Loop optimised into div (?)

Mon Oct 28 02:27:11 PDT 2019

Hi Bruce,

I understand what you say, but the optimisation you suggest requires the compiler to generate run-time checks and to split code into several execution blocks, some of them may not get ever executed, or prevent further optimisation, and it adds some overhead on the best case. In my opinion that can be perceived as some overengineering that may not be required. 

My preferred approach is to actually give some trust to the user source code and just 'relax' for some targets the optimisations that are known to be uncertain for such targets.

I found this when compiling a function that will convert a binary number to a sequence of decimal digits expressed in ascii. Every digit is determined by counting the number of subtractions required for power-or-ten factors . For that algorithm, only up to 9 subtractions are required at every step (or factor). Replacing that by divisions is a lot more expensive for the affected targets. 

My point is that the user programer must be assumed to know what she/he is doing, and if she/he decided to place a loop instead of a division on a target with expensive division, then we must just keep the loop.

Based on a previous answer, I figured out that this optimisation is made in "IndVarSimplify::rewriteLoopExitValues”. This is part of the "target-independent” optimisations, and therefore there’s no target specific hook that can be implemented to prevent it. There’s however a hidden "-mllvm -replexitval=never” clang command line option that will prevent ALL exit loop values to be replaced by linear expressions, which will solve this particular issue, but it will also stop other desired optimisations, so it’s not a perfect solution.

In relation to (supposed) “target-independent” optimisations, I presented related issues under the subject "[llvm-dev] [AVR] [MSP430] Code gen improvements for 8 bit and 16 bit targets”. One of my posts under that subject is a long introduction to the whole issue, starting from the eventuality that some LLVM target-independent passes make assumptions that are not really applicable for all targets.

John

> On 28 Oct 2019, at 04:30, Bruce Hoult <brucehoult at sifive.com> wrote:
> 
> On Sun, Oct 27, 2019 at 7:38 PM Bruce Hoult <brucehoult at sifive.com> wrote:
>> 
>> On Wed, Oct 23, 2019 at 10:29 AM Joan Lluch via cfe-dev
>> <cfe-dev at lists.llvm.org> wrote:
>>> 
>>> Hi all,
>>> 
>>> For the following code
>>> 
>>> int countHundred( int num )
>>> {
>>>  int count = 0;
>>>  while ( num >= 100) { count++ ; num = num - 100; }
>>>  return count;
>>> }
>>> 
>>> the loop shown above is optimised away and replaced by ‘udiv’ instruction.
>>> 
>>> This is undesirable for targets requiring libcalls for integer divisions, such as the MSP430.
>> 
>> That's not necessarily the case, unless you know in advance that "num"
>> is a relatively small positive number.
>> 
>> Even on a 16 bit machine, large positive numbers could have loop trip
>> counts up to 327 which is going to be much slower than a library call
>> using even a simple shift-and-subtract division algorithm.
>> 
>> I hope the udiv isn't used until after an initial test to make sure
>> "num" is >100 or at least positive, otherwise using udiv is an actual
>> bug.
>> 
>> I'd suggest doing some testing to find out at which point a library
>> call is actually faster than this loop (probably anything above 2000
>> or 4000 or so), modifying the compiler to not do the transform if the
>> trip count is *known* to be less than this (i.e. continue to use the
>> library routine if the trip count is unknown), and also rewrite the
>> code to:
>> 
>> #define ITER_LIMIT 24 // for example -- determine by experiment and
>> modify compiler appropriately
>> int countHundred( int num )
>> {
>>  if (num > (ITER_LIMIT * 100)) return num / 100;
>>  int count = 0;
>>  while ( num >= 100) { count++ ; num = num - 100; }
>>  return count;
>> }
> 
> I'm seeing the same behaviour with RISC-V rv32i (no multiply or divide
> instructions), even with -O1.
> 
> So this is something that should be done in general for ISAs without
> hardware divide.