[LLVMdev] Efficient Pattern matching in Instruction Combine

Wed Aug 13 15:23:03 PDT 2014

On Wed, Aug 13, 2014 at 9:37 AM, suyog sarda <sardask01 at gmail.com> wrote:

> Thanks Sean for the reference.
>
> I will go through it and see if i can implement it for generic boolean
> expression minimization.
>

Hi Suyog,

One thing to keep in mind though is what you are trying to get out of all
this work, as my suggestion might be a lot of work and there could be
simpler solutions that meet your immediate needs.

What do you see as the end-result for instcombine's capability with boolean
expression simplification? Is there a particular goal that you are working
towards? The same also some of the other folks (coworkers of yours?) that
have been contributing similar instcombines lately.

Also remember that basically anything that can be called a boolean
minimization algorithm is going to have a worst-case exponential run time
in the number of variables. An easy way to see this is that if your
algorithm can decide "this expression simplifies to just False", then you
can use it to solve SAT: if the expression simplifies to just False, then
it is unsatisfiable, otherwise it is satisfiable.

-- Sean Silva

>
> Regards,
> Suyog
>
>
> On Wed, Aug 13, 2014 at 2:30 AM, Sean Silva <chisophugis at gmail.com> wrote:
>
>> Re-adding the mailing list (remember to hit "reply all")
>>
>>
>> On Tue, Aug 12, 2014 at 9:36 AM, suyog sarda <sardask01 at gmail.com> wrote:
>>
>>> Thanks Sean for the reply.
>>>
>>>
>>> On Mon, Aug 11, 2014 at 11:49 PM, Sean Silva <chisophugis at gmail.com>
>>> wrote:
>>>
>>>>
>>>>
>>>>
>>>> On Fri, Aug 8, 2014 at 1:34 PM, suyog sarda <sardask01 at gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Duncan, David, Sean.
>>>>>
>>>>> Thanks for your reply.
>>>>>
>>>>>
>>>>> > It'd be interesting if you could find a design that also treated
>>>>> these
>>>>> > the same:
>>>>> >
>>>>> >   (B ^ A) | ((A ^ B) ^ C) -> (A ^ B) | C
>>>>>
>>>>> >   (B ^ A) | ((B ^ C) ^ A) -> (A ^ B) | C
>>>>> >   (B ^ A) | ((C ^ A) ^ B) -> (A ^ B) | C
>>>>> >
>>>>> > I.e., `^` is also associative.
>>>>>
>>>>> Agree with Duncan on including associative operation too.
>>>>>
>>>>> > Can we handle this by just having a canonical ordering? Or is that
>>>>> too difficult to maintain through various instcombines?
>>>>>
>>>>> Yes, its the easiest way to do that. If i am not wrong, what Sean is
>>>>> suggesting is that if we get
>>>>>
>>>>> something like
>>>>>
>>>>> (B ^ A) | ((B ^ C) ^ A) -> (A ^ B) | C
>>>>>
>>>>> and we have written pass for pattern
>>>>>
>>>>> (A ^ B) | ((B ^ C) ^ A) -> (A ^ B) | C
>>>>>
>>>>> we could just swap 'B' and 'A' before matching the pattern i.e. we do
>>>>> canonical re-ordering
>>>>> (Sean, Correct me if i am wrong on my understanding of what you meant
>>>>> by canonical ordering).
>>>>>
>>>>
>>>> You understood correctly.
>>>>
>>>>
>>>>>
>>>>> I have seen in the "instcombine" module that many times std::swap was
>>>>> used for the same.
>>>>>
>>>>> But it becomes too much repetitive and somewhat irritating to keep
>>>>> doing this for every pattern.
>>>>> Shouldn't we do this in more generic way where we do not have to worry
>>>>> about swapping operands?
>>>>> (How to do it - i don't know for now. I looked at the matchers
>>>>> implementation for modification/improvement,
>>>>> but didn't find any clue on it).
>>>>>
>>>>>
>>>>> > It seems to me that we could rejigger Reassociate to reduce the
>>>>> number of possibilities that InstCombine could expect.
>>>>>
>>>>>>
>>>>> Agree with David that infact it should be reassociate pass which
>>>>> should handle this.
>>>>>
>>>>> But following is something interesting and rather worrying things i
>>>>> have found :
>>>>> (I have omitted unimportant code and highlighted important one in
>>>>> following example)
>>>>> *e.x. *
>>>>>
>>>>> ((~A & B) | A)  -->  (A | B) ; Code is implemented for this already
>>>>>
>>>>> *C code :*
>>>>>
>>>>> suyog at suyog-Inspiron-N5010:~$ cat 1.c
>>>>> #include<stdio.h>
>>>>> int cal(int a, int b) {
>>>>> *return ((~a & b) | a);*
>>>>> }
>>>>>
>>>>> int main(){
>>>>> int a, b;
>>>>> scanf("%d %d", &a, &b);
>>>>> return cal(a,b);
>>>>> }
>>>>>
>>>>> LLVM IR at O0 :
>>>>>
>>>>> suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/clang -S -O0 -emit-llvm
>>>>> 1.c
>>>>>
>>>>> ; Function Attrs: nounwind
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> *define i32 @cal(i32 %a, i32 %b) #0 {  %1 = xor i32 %a, -1  %2 = and
>>>>> i32 %1, %b  %3 = or i32 %2, %a  ret i32 %3*
>>>>> }
>>>>>
>>>>> ; Function Attrs: nounwind
>>>>> define i32 @main() #0 {
>>>>>   ..
>>>>>   ..
>>>>>   %4 = call i32 @cal(i32 %2, i32 %3)
>>>>>   ret i32 %4
>>>>> }
>>>>>
>>>>> *LLVM IR at instcombine :*
>>>>>
>>>>> suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/opt -S -instcombine 1.ll
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> *; Function Attrs: nounwinddefine i32 @cal(i32 %a, i32 %b) #0 {   %1 =
>>>>> or i32 %a, %b  ret i32 %1}*
>>>>> ..
>>>>> ..
>>>>> ..
>>>>>
>>>>> which is OK as per expected transform.
>>>>>
>>>>>
>>>>>
>>>>> *LLVM IR at reassociate and instcombine : *suyog at suyog-Inspiron-N5010:~$
>>>>> Open/rbuild/bin/opt -S -reassociate -instcombine 2.ll
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> *; Function Attrs: nounwinddefine i32 @cal(i32 %a, i32 %b) #0 {  %1 =
>>>>> xor i32 %a, -1  %2 = and i32 %b, %1   %3 = or i32 %2, %a  ret i32 %3}*
>>>>>
>>>>> ..
>>>>> ..
>>>>>
>>>>> Reassociate converted (~a&b) to (b&~a) and instruction combine failed
>>>>> to optimize.
>>>>>
>>>>> As evident, sometimes, reassociate pass actually harms the
>>>>> optimization opportunity !!
>>>>>
>>>>> Some more interesting findings on how can this affect the final
>>>>> machine code:
>>>>>
>>>>> *Test case :*
>>>>> 1.c :
>>>>>
>>>>> #include<stdio.h>
>>>>> int cal(int a, int b) {
>>>>> *return ((~a & b) | a);*
>>>>> }
>>>>>
>>>>> int main(){
>>>>> int a, b;
>>>>> scanf("%d %d", &a, &b);
>>>>> return cal(a,b);
>>>>> }
>>>>>
>>>>> *X86 .s file with clang at O2 for above program :*
>>>>>
>>>>> suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/clang -S -O2 1.c
>>>>>
>>>>> main:                                   # @main
>>>>> # BB#0:
>>>>>     subl    $28, %esp
>>>>>     leal    20(%esp), %eax
>>>>>     movl    %eax, 8(%esp)
>>>>>     leal    24(%esp), %eax
>>>>>     movl    %eax, 4(%esp)
>>>>>     movl    $.L.str, (%esp)
>>>>>     calll    __isoc99_scanf
>>>>>     movl    20(%esp), %eax
>>>>> *    orl    24(%esp), %eax*
>>>>>     addl    $28, %esp
>>>>>     retl
>>>>>
>>>>> As seen, optimization happened at IR level itself reflected in .s file.
>>>>>
>>>>> *GCC output for the same:*
>>>>>
>>>>> suyog at suyog-Inspiron-N5010:~$ gcc -S -O2 1.c
>>>>>
>>>>> main:
>>>>> .LFB23:
>>>>>     .cfi_startproc
>>>>>     pushl    %ebp
>>>>>     .cfi_def_cfa_offset 8
>>>>>     .cfi_offset 5, -8
>>>>>     movl    %esp, %ebp
>>>>>     .cfi_def_cfa_register 5
>>>>>     andl    $-16, %esp
>>>>>     subl    $32, %esp
>>>>>     leal    28(%esp), %eax
>>>>>     movl    %eax, 8(%esp)
>>>>>     leal    24(%esp), %eax
>>>>>     movl    %eax, 4(%esp)
>>>>>     movl    $.LC0, (%esp)
>>>>>     call    __isoc99_scanf
>>>>>     movl    24(%esp), %eax
>>>>> *    orl    28(%esp), %eax*
>>>>>     leave
>>>>>     .cfi_restore 5
>>>>>     .cfi_def_cfa 4, 4
>>>>>     ret
>>>>>     .cfi_endproc
>>>>>
>>>>> GCC also did the optimization.
>>>>>
>>>>> Now we just *slightly flip* the test case :
>>>>>
>>>>>
>>>>>
>>>>> *1.c Test case:*#include<stdio.h>
>>>>> int cal(int a, int b) {
>>>>> *return ((b & ~a) | a);*
>>>>> }
>>>>>
>>>>> int main(){
>>>>> int a, b;
>>>>> scanf("%d %d", &a, &b);
>>>>> return cal(a,b);
>>>>> }
>>>>>
>>>>>
>>>>>
>>>>> *X86 .s file with clang at O2 for above program :*
>>>>> suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/clang -S -O2 1.c
>>>>>
>>>>> main:                                   # @main
>>>>> # BB#0:
>>>>>     subl    $28, %esp
>>>>>     leal    20(%esp), %eax
>>>>>     movl    %eax, 8(%esp)
>>>>>     leal    24(%esp), %eax
>>>>>     movl    %eax, 4(%esp)
>>>>>     movl    $.L.str, (%esp)
>>>>>     calll    __isoc99_scanf
>>>>>     movl    24(%esp), %ecx
>>>>>     movl    %ecx, %eax
>>>>>
>>>>>
>>>>> *notl    %eax    andl    20(%esp), %eax    orl    %ecx, %eax*
>>>>>     addl    $28, %esp
>>>>>     retl
>>>>>
>>>>> We lost the Optimization opportunity here !!
>>>>>
>>>>> *GCC output for the same:*
>>>>>
>>>>> suyog at suyog-Inspiron-N5010:~$ gcc -S -O2 1.c
>>>>>
>>>>> main:
>>>>> .LFB23:
>>>>>     .cfi_startproc
>>>>>     pushl    %ebp
>>>>>     .cfi_def_cfa_offset 8
>>>>>     .cfi_offset 5, -8
>>>>>     movl    %esp, %ebp
>>>>>     .cfi_def_cfa_register 5
>>>>>     andl    $-16, %esp
>>>>>     subl    $32, %esp
>>>>>     leal    28(%esp), %eax
>>>>>     movl    %eax, 8(%esp)
>>>>>     leal    24(%esp), %eax
>>>>>     movl    %eax, 4(%esp)
>>>>>     movl    $.LC0, (%esp)
>>>>>     call    __isoc99_scanf
>>>>>     movl    24(%esp), %eax
>>>>> *    orl    28(%esp), %eax*
>>>>>     leave
>>>>>     .cfi_restore 5
>>>>>     .cfi_def_cfa 4, 4
>>>>>     ret
>>>>>     .cfi_endproc
>>>>>
>>>>>  GCC still optimized it !!
>>>>>
>>>>> Clearly evident that llvm is losing because of naive approach of
>>>>> pattern matching.
>>>>>
>>>>> How can we improve this so that LLVM also recognizes
>>>>> commutative/associative
>>>>> pattern  efficiently with single 'matchers' call? I am not having any
>>>>> idea so far.
>>>>> I will think more on it. Any help would be really helpful.
>>>>>
>>>>> Any suggestions/comments/rectification are most awaited :)
>>>>>
>>>>
>>>>
>>>> I'm actually sort of wondering if instead of hardcoding all these
>>>> bitwise simplifications, we can just use a general boolean minimization
>>>> algorithm.
>>>>
>>>> Is there a reason we don't? Maybe it's just compile time? But
>>>> eventually (maybe already?) adding enough special case combines will be
>>>> slower than using the general optimization.
>>>>
>>>
>>> Can you please elaborate more on "general boolean minimization
>>> algorithm" ? Is it a specific algorithm? I couldn't find any proper
>>> reference to it. If you can provide a reference for it, i will try to
>>> see if i can implement it.
>>>
>>
>>> Can K-Map be an option to generalize the logic?
>>>  (I could think of K-Map for now for simplification from my college days
>>> :))
>>>
>>
>> Yeah, K-Maps are one way to do it, however K-Maps are really most useful
>> for when your target operations are all NAND/NOR, which isn't the case here
>> (and you can also do Reed-Muller for XOR/XNOR); instcombine has a very
>> different set of "primitive operations" and their relative costs (i.e. XOR,
>> AND, OR, and NOT are usually the base operations and cost the same amount
>> of resources to execute).
>>
>> Since optimization of binary expressions is of such enormous importance
>> to the electronics industry, this topic is extremely well researched, so
>> there should be a lot to draw from. There may be stuff targeted
>> specifically at software compilers, but I haven't looked in depth (maybe
>> try Muchnick section 12.3 for references? I don't have the book handy).
>>
>> For starters, you may want to check out Knuth volume 4A, section 7.1. I'm
>> working from memory here, but IIRC there is a whole part about
>> minimization/optimization of binary expressions. Knuth as usual provides
>> tons of references to the literature in case the text there is not enough.
>> You can see preprint copies of volume 4A here:
>> http://www.cs.utsa.edu/~wagner/knuth/ (although Knuth is the sort of
>> book that any programmer should have on their bookshelf; in fact, you can
>> probably justify it as a work expense).
>>
>>
>>
>>>
>>> Also few more things i will like to understand.
>>> From above example i have shown,
>>>
>>> when we have test case as:
>>>
>>> #include<stdio.h>
>>> int cal(int a, int b) {
>>> *return ((~a & b) | a);*
>>> }
>>>
>>> int main(){
>>> int a, b;
>>> scanf("%d %d", &a, &b);
>>> return cal(a,b);
>>> }
>>>
>>> *X86 .s file with clang at O2 for above program :*
>>>
>>>  main:                                   # @main
>>> # BB#0:
>>>     subl    $28, %esp
>>>     leal    20(%esp), %eax
>>>     movl    %eax, 8(%esp)
>>>     leal    24(%esp), %eax
>>>     movl    %eax, 4(%esp)
>>>     movl    $.L.str, (%esp)
>>>     calll    __isoc99_scanf
>>>     movl    20(%esp), %eax
>>> *    orl    24(%esp), %eax*
>>>     addl    $28, %esp
>>>     retl
>>>
>>> As seen, we optimize the pattern.
>>>
>>> but when we have test case as :
>>>
>>> #include<stdio.h>
>>> int cal(int a, int b) {
>>> *return ((b & ~a) | a);*
>>> }
>>>
>>> int main(){
>>> int a, b;
>>> scanf("%d %d", &a, &b);
>>> return cal(a,b);
>>> }
>>>
>>>
>>>
>>> *X86 .s file with clang at O2 for above program :*
>>> suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/clang -S -O2 1.c
>>>
>>> main:                                   # @main
>>> # BB#0:
>>>     subl    $28, %esp
>>>     leal    20(%esp), %eax
>>>     movl    %eax, 8(%esp)
>>>     leal    24(%esp), %eax
>>>     movl    %eax, 4(%esp)
>>>     movl    $.L.str, (%esp)
>>>     calll    __isoc99_scanf
>>>     movl    24(%esp), %ecx
>>>     movl    %ecx, %eax
>>>
>>>
>>> *notl    %eax    andl    20(%esp), %eax    orl    %ecx, %eax*
>>>     addl    $28, %esp
>>>     retl
>>>
>>>  we fail  to optimize.
>>>
>>> Does, 'reassociate' pass always run before 'instcombine' ?
>>> Theoretically, it should in my opinion. But from the output in first
>>> case, it seems it doesn't.
>>> If 'reassociate' would have run before 'instcombine', it would have
>>> converted (~a&b) to (b&~a)
>>> and the pattern would have not been optimized.
>>>
>>> Is there any logic which determines when to run reassociate and when not
>>> to run?
>>>
>>
>> I don't know off the top of my head. Try running opt with
>> -debug-pass=Arguments to see what the order is (or I think you should be
>> able to pass -debug-pass=Arguments to clang's -mllvm option).
>>
>> -- Sean Silva
>>
>>
>>>
>>> Duncan, David, Sean - any comment on this are most welcomed :)
>>>
>>> Regards,
>>> Suyog
>>>
>>>
>>
>
>
> --
> With regards,
> Suyog Sarda
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140813/7bda05bc/attachment.html>