[LLVMdev] Efficient Pattern matching in Instruction Combine

Wed Aug 13 15:39:09 PDT 2014

Even if you can't implement such an algorithm sanely, ISTM that
auto-generating this code from a table (or whatever), and choosing
canonical results (to avoid a fixpoint issue), rather than what seems
to be hand-additions of every possible set of minimizations on three
variables, is still a better solution, no?

At least then you wouldn't have human errors, and a growing file that
makes any of the non-trivial transforms easy to miss in the noise.

On Tue, Aug 12, 2014 at 2:00 PM, Sean Silva <chisophugis at gmail.com> wrote:
> Re-adding the mailing list (remember to hit "reply all")
>
>
> On Tue, Aug 12, 2014 at 9:36 AM, suyog sarda <sardask01 at gmail.com> wrote:
>>
>> Thanks Sean for the reply.
>>
>>
>> On Mon, Aug 11, 2014 at 11:49 PM, Sean Silva <chisophugis at gmail.com>
>> wrote:
>>>
>>>
>>>
>>>
>>> On Fri, Aug 8, 2014 at 1:34 PM, suyog sarda <sardask01 at gmail.com> wrote:
>>>>
>>>> Hi Duncan, David, Sean.
>>>>
>>>> Thanks for your reply.
>>>>
>>>>
>>>> > It'd be interesting if you could find a design that also treated these
>>>> > the same:
>>>> >
>>>> >   (B ^ A) | ((A ^ B) ^ C) -> (A ^ B) | C
>>>>
>>>> >   (B ^ A) | ((B ^ C) ^ A) -> (A ^ B) | C
>>>> >   (B ^ A) | ((C ^ A) ^ B) -> (A ^ B) | C
>>>> >
>>>> > I.e., `^` is also associative.
>>>>
>>>> Agree with Duncan on including associative operation too.
>>>>
>>>> > Can we handle this by just having a canonical ordering? Or is that too
>>>> > difficult to maintain through various instcombines?
>>>>
>>>> Yes, its the easiest way to do that. If i am not wrong, what Sean is
>>>> suggesting is that if we get
>>>>
>>>> something like
>>>>
>>>> (B ^ A) | ((B ^ C) ^ A) -> (A ^ B) | C
>>>>
>>>> and we have written pass for pattern
>>>>
>>>> (A ^ B) | ((B ^ C) ^ A) -> (A ^ B) | C
>>>>
>>>> we could just swap 'B' and 'A' before matching the pattern i.e. we do
>>>> canonical re-ordering
>>>> (Sean, Correct me if i am wrong on my understanding of what you meant by
>>>> canonical ordering).
>>>
>>>
>>> You understood correctly.
>>>
>>>>
>>>>
>>>> I have seen in the "instcombine" module that many times std::swap was
>>>> used for the same.
>>>>
>>>> But it becomes too much repetitive and somewhat irritating to keep doing
>>>> this for every pattern.
>>>> Shouldn't we do this in more generic way where we do not have to worry
>>>> about swapping operands?
>>>> (How to do it - i don't know for now. I looked at the matchers
>>>> implementation for modification/improvement,
>>>> but didn't find any clue on it).
>>>>
>>>>
>>>> > It seems to me that we could rejigger Reassociate to reduce the number
>>>> > of possibilities that InstCombine could expect.
>>>>
>>>>
>>>> Agree with David that infact it should be reassociate pass which should
>>>> handle this.
>>>>
>>>> But following is something interesting and rather worrying things i have
>>>> found :
>>>> (I have omitted unimportant code and highlighted important one in
>>>> following example)
>>>> e.x.
>>>>
>>>> ((~A & B) | A)  -->  (A | B) ; Code is implemented for this already
>>>>
>>>> C code :
>>>>
>>>> suyog at suyog-Inspiron-N5010:~$ cat 1.c
>>>> #include<stdio.h>
>>>> int cal(int a, int b) {
>>>> return ((~a & b) | a);
>>>> }
>>>>
>>>> int main(){
>>>> int a, b;
>>>> scanf("%d %d", &a, &b);
>>>> return cal(a,b);
>>>> }
>>>>
>>>> LLVM IR at O0 :
>>>>
>>>> suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/clang -S -O0 -emit-llvm
>>>> 1.c
>>>>
>>>> ; Function Attrs: nounwind
>>>> define i32 @cal(i32 %a, i32 %b) #0 {
>>>>   %1 = xor i32 %a, -1
>>>>   %2 = and i32 %1, %b
>>>>   %3 = or i32 %2, %a
>>>>   ret i32 %3
>>>> }
>>>>
>>>> ; Function Attrs: nounwind
>>>> define i32 @main() #0 {
>>>>   ..
>>>>   ..
>>>>   %4 = call i32 @cal(i32 %2, i32 %3)
>>>>   ret i32 %4
>>>> }
>>>>
>>>> LLVM IR at instcombine :
>>>>
>>>> suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/opt -S -instcombine 1.ll
>>>>
>>>> ; Function Attrs: nounwind
>>>> define i32 @cal(i32 %a, i32 %b) #0 {
>>>>   %1 = or i32 %a, %b
>>>>   ret i32 %1
>>>> }
>>>> ..
>>>> ..
>>>> ..
>>>>
>>>> which is OK as per expected transform.
>>>>
>>>> LLVM IR at reassociate and instcombine :
>>>>
>>>> suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/opt -S -reassociate
>>>> -instcombine 2.ll
>>>>
>>>> ; Function Attrs: nounwind
>>>> define i32 @cal(i32 %a, i32 %b) #0 {
>>>>   %1 = xor i32 %a, -1
>>>>   %2 = and i32 %b, %1
>>>>   %3 = or i32 %2, %a
>>>>   ret i32 %3
>>>> }
>>>>
>>>> ..
>>>> ..
>>>>
>>>> Reassociate converted (~a&b) to (b&~a) and instruction combine failed to
>>>> optimize.
>>>>
>>>> As evident, sometimes, reassociate pass actually harms the optimization
>>>> opportunity !!
>>>>
>>>> Some more interesting findings on how can this affect the final machine
>>>> code:
>>>>
>>>> Test case :
>>>> 1.c :
>>>>
>>>> #include<stdio.h>
>>>> int cal(int a, int b) {
>>>> return ((~a & b) | a);
>>>> }
>>>>
>>>> int main(){
>>>> int a, b;
>>>> scanf("%d %d", &a, &b);
>>>> return cal(a,b);
>>>> }
>>>>
>>>> X86 .s file with clang at O2 for above program :
>>>>
>>>> suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/clang -S -O2 1.c
>>>>
>>>> main:                                   # @main
>>>> # BB#0:
>>>>     subl    $28, %esp
>>>>     leal    20(%esp), %eax
>>>>     movl    %eax, 8(%esp)
>>>>     leal    24(%esp), %eax
>>>>     movl    %eax, 4(%esp)
>>>>     movl    $.L.str, (%esp)
>>>>     calll    __isoc99_scanf
>>>>     movl    20(%esp), %eax
>>>>     orl    24(%esp), %eax
>>>>     addl    $28, %esp
>>>>     retl
>>>>
>>>> As seen, optimization happened at IR level itself reflected in .s file.
>>>>
>>>> GCC output for the same:
>>>>
>>>> suyog at suyog-Inspiron-N5010:~$ gcc -S -O2 1.c
>>>>
>>>> main:
>>>> .LFB23:
>>>>     .cfi_startproc
>>>>     pushl    %ebp
>>>>     .cfi_def_cfa_offset 8
>>>>     .cfi_offset 5, -8
>>>>     movl    %esp, %ebp
>>>>     .cfi_def_cfa_register 5
>>>>     andl    $-16, %esp
>>>>     subl    $32, %esp
>>>>     leal    28(%esp), %eax
>>>>     movl    %eax, 8(%esp)
>>>>     leal    24(%esp), %eax
>>>>     movl    %eax, 4(%esp)
>>>>     movl    $.LC0, (%esp)
>>>>     call    __isoc99_scanf
>>>>     movl    24(%esp), %eax
>>>>     orl    28(%esp), %eax
>>>>     leave
>>>>     .cfi_restore 5
>>>>     .cfi_def_cfa 4, 4
>>>>     ret
>>>>     .cfi_endproc
>>>>
>>>> GCC also did the optimization.
>>>>
>>>> Now we just slightly flip the test case :
>>>>
>>>> 1.c Test case:
>>>>
>>>> #include<stdio.h>
>>>> int cal(int a, int b) {
>>>> return ((b & ~a) | a);
>>>> }
>>>>
>>>> int main(){
>>>> int a, b;
>>>> scanf("%d %d", &a, &b);
>>>> return cal(a,b);
>>>> }
>>>>
>>>> X86 .s file with clang at O2 for above program :
>>>>
>>>> suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/clang -S -O2 1.c
>>>>
>>>> main:                                   # @main
>>>> # BB#0:
>>>>     subl    $28, %esp
>>>>     leal    20(%esp), %eax
>>>>     movl    %eax, 8(%esp)
>>>>     leal    24(%esp), %eax
>>>>     movl    %eax, 4(%esp)
>>>>     movl    $.L.str, (%esp)
>>>>     calll    __isoc99_scanf
>>>>     movl    24(%esp), %ecx
>>>>     movl    %ecx, %eax
>>>>     notl    %eax
>>>>     andl    20(%esp), %eax
>>>>     orl    %ecx, %eax
>>>>     addl    $28, %esp
>>>>     retl
>>>>
>>>> We lost the Optimization opportunity here !!
>>>>
>>>> GCC output for the same:
>>>>
>>>> suyog at suyog-Inspiron-N5010:~$ gcc -S -O2 1.c
>>>>
>>>> main:
>>>> .LFB23:
>>>>     .cfi_startproc
>>>>     pushl    %ebp
>>>>     .cfi_def_cfa_offset 8
>>>>     .cfi_offset 5, -8
>>>>     movl    %esp, %ebp
>>>>     .cfi_def_cfa_register 5
>>>>     andl    $-16, %esp
>>>>     subl    $32, %esp
>>>>     leal    28(%esp), %eax
>>>>     movl    %eax, 8(%esp)
>>>>     leal    24(%esp), %eax
>>>>     movl    %eax, 4(%esp)
>>>>     movl    $.LC0, (%esp)
>>>>     call    __isoc99_scanf
>>>>     movl    24(%esp), %eax
>>>>     orl    28(%esp), %eax
>>>>     leave
>>>>     .cfi_restore 5
>>>>     .cfi_def_cfa 4, 4
>>>>     ret
>>>>     .cfi_endproc
>>>>
>>>> GCC still optimized it !!
>>>>
>>>> Clearly evident that llvm is losing because of naive approach of pattern
>>>> matching.
>>>>
>>>> How can we improve this so that LLVM also recognizes
>>>> commutative/associative
>>>> pattern  efficiently with single 'matchers' call? I am not having any
>>>> idea so far.
>>>> I will think more on it. Any help would be really helpful.
>>>>
>>>> Any suggestions/comments/rectification are most awaited :)
>>>
>>>
>>>
>>> I'm actually sort of wondering if instead of hardcoding all these bitwise
>>> simplifications, we can just use a general boolean minimization algorithm.
>>>
>>> Is there a reason we don't? Maybe it's just compile time? But eventually
>>> (maybe already?) adding enough special case combines will be slower than
>>> using the general optimization.
>>
>>
>> Can you please elaborate more on "general boolean minimization algorithm"
>> ? Is it a specific algorithm? I couldn't find any proper
>> reference to it. If you can provide a reference for it, i will try to see
>> if i can implement it.
>>
>>
>> Can K-Map be an option to generalize the logic?
>>  (I could think of K-Map for now for simplification from my college days
>> :))
>
>
> Yeah, K-Maps are one way to do it, however K-Maps are really most useful for
> when your target operations are all NAND/NOR, which isn't the case here (and
> you can also do Reed-Muller for XOR/XNOR); instcombine has a very different
> set of "primitive operations" and their relative costs (i.e. XOR, AND, OR,
> and NOT are usually the base operations and cost the same amount of
> resources to execute).
>
> Since optimization of binary expressions is of such enormous importance to
> the electronics industry, this topic is extremely well researched, so there
> should be a lot to draw from. There may be stuff targeted specifically at
> software compilers, but I haven't looked in depth (maybe try Muchnick
> section 12.3 for references? I don't have the book handy).
>
> For starters, you may want to check out Knuth volume 4A, section 7.1. I'm
> working from memory here, but IIRC there is a whole part about
> minimization/optimization of binary expressions. Knuth as usual provides
> tons of references to the literature in case the text there is not enough.
> You can see preprint copies of volume 4A here:
> http://www.cs.utsa.edu/~wagner/knuth/ (although Knuth is the sort of book
> that any programmer should have on their bookshelf; in fact, you can
> probably justify it as a work expense).
>
>
>>
>>
>> Also few more things i will like to understand.
>> From above example i have shown,
>>
>> when we have test case as:
>>
>> #include<stdio.h>
>> int cal(int a, int b) {
>> return ((~a & b) | a);
>> }
>>
>> int main(){
>> int a, b;
>> scanf("%d %d", &a, &b);
>> return cal(a,b);
>> }
>>
>> X86 .s file with clang at O2 for above program :
>>
>> main:                                   # @main
>> # BB#0:
>>     subl    $28, %esp
>>     leal    20(%esp), %eax
>>     movl    %eax, 8(%esp)
>>     leal    24(%esp), %eax
>>     movl    %eax, 4(%esp)
>>     movl    $.L.str, (%esp)
>>     calll    __isoc99_scanf
>>     movl    20(%esp), %eax
>>     orl    24(%esp), %eax
>>     addl    $28, %esp
>>     retl
>>
>> As seen, we optimize the pattern.
>>
>> but when we have test case as :
>>
>> #include<stdio.h>
>> int cal(int a, int b) {
>> return ((b & ~a) | a);
>> }
>>
>> int main(){
>> int a, b;
>> scanf("%d %d", &a, &b);
>> return cal(a,b);
>> }
>>
>> X86 .s file with clang at O2 for above program :
>>
>> suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/clang -S -O2 1.c
>>
>> main:                                   # @main
>> # BB#0:
>>     subl    $28, %esp
>>     leal    20(%esp), %eax
>>     movl    %eax, 8(%esp)
>>     leal    24(%esp), %eax
>>     movl    %eax, 4(%esp)
>>     movl    $.L.str, (%esp)
>>     calll    __isoc99_scanf
>>     movl    24(%esp), %ecx
>>     movl    %ecx, %eax
>>     notl    %eax
>>     andl    20(%esp), %eax
>>     orl    %ecx, %eax
>>     addl    $28, %esp
>>     retl
>>
>>  we fail  to optimize.
>>
>> Does, 'reassociate' pass always run before 'instcombine' ?
>> Theoretically, it should in my opinion. But from the output in first case,
>> it seems it doesn't.
>> If 'reassociate' would have run before 'instcombine', it would have
>> converted (~a&b) to (b&~a)
>> and the pattern would have not been optimized.
>>
>> Is there any logic which determines when to run reassociate and when not
>> to run?
>
>
> I don't know off the top of my head. Try running opt with
> -debug-pass=Arguments to see what the order is (or I think you should be
> able to pass -debug-pass=Arguments to clang's -mllvm option).
>
> -- Sean Silva
>
>>
>>
>> Duncan, David, Sean - any comment on this are most welcomed :)
>>
>> Regards,
>> Suyog
>>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>