Ping/RFC: Idea for optimization (test for remainder)

Sun Mar 30 11:33:16 PDT 2014

Hello Benjamin, hello folks!

May I ping on the message below (2014-03-22, same title) and on 
http://llvm.org/bugs/show_bug.cgi?id=19206 ?
My patch I posted 2014-03-08 in this forum seems to work flawlessly on 
unsigned expressions and can be extended to also work on signed ones.
I am not sure whether the patch is on the right place.

===

>>>> Consider the expression (x % d) == c where d and c are constants.
>>>> For simplicity let us assume that x is unsigned and 0 <= c < d.
>>>> Let us further assume that d = a * (1 << b) and a is odd.
>>>> Then our expression can be transformed to
>>>> rotate_right(x-c, b) * inverse_mul(a) <= (high_value(x) - c) / d .
>>>> Example [(x % 250) == 3]:
>>>>    sub eax,3
>>>>    ror eax,1
>>>>    imul eax,eax,0x26e978d5  // multiplicative inverse of 125
>>>>    cmp eax,17179869  // (0xffffffff-3) / 250
>>>>    jbe OK
>>>> [...]

>>> Yep, this is a long-standing issue in the peephole optimizer.
>>> It's not easily fixed because
>>> 1. We don't want to do it early (i.e. before codegen) because
>>>     the resulting expression is harder to analyze wrt. value range.
>>> 2. We can't do it late (in DAGCombiner) because it works top-down
>>>     and has already expanded the operation into the code you posted
>>>     above by the time it sees the compare.

>> Well, I tried a solution using the instruction combiner,
>> and it turned out well.
>> I attach a working patch for unsigned values.
>> The signed version will come later if this patch is accepted.

The signed case is much more complicated but nevertheless generates the
same code.

> The problem I see with this approach is that InstCombine is
> primarily a canonicalization pass.

My approach is very similar to the already existing function
FoldICmpDivCst which converts x/d==c or similar to a range check, so it
ought to be right place.

> Your change lets it generate a ton of code,
> making life harder for our analysis passes
> (think of what happens when your transformation changes
> a loop condition).

Do you really consider 4 operations on one expression using 4 constants
for the worst case a "ton of code"?

> We should do this kind of thing later.

Why?
I have seen that the compiler does other optimizations on the generated
code such as exchanging multiplication and addition or re-using
subexpressions.

> We can't do it in the DAG combining stage where we
> expand other divides because it works top-down.
> Maybe we can put this in CodeGenPrepare, which works on
> IR and runs just before DAG lowering.
> It also has the ability to insert branching in case you
> need it for range checks.

As far as I can see, for this optimization there is no need for range
checks; I never need more than the mentioned code.

> The pass is a bit of hack though,
> maybe someone else has a better idea.

I hope that others join the discussion.

To make things easier to track, I have created a bug entry:
http://llvm.org/bugs/show_bug.cgi?id=19206

>> How could I detect and include an additional range check
>> which is possible with the same amount of generated code?

Ping.

>> By the way: Is there something like a floored
>> or Euclidian remainder/modulo operation
>> (see http://en.wikipedia.org/wiki/Modulo_operation)?
>> How is it realized?

Ping.

Best regards
Jasper