[LLVMdev] Efficient Pattern matching in Instruction Combine

Mon Aug 11 11:19:17 PDT 2014

On Fri, Aug 8, 2014 at 1:34 PM, suyog sarda <sardask01 at gmail.com> wrote:

> Hi Duncan, David, Sean.
>
> Thanks for your reply.
>
>
> > It'd be interesting if you could find a design that also treated these
> > the same:
> >
> >   (B ^ A) | ((A ^ B) ^ C) -> (A ^ B) | C
>
> >   (B ^ A) | ((B ^ C) ^ A) -> (A ^ B) | C
> >   (B ^ A) | ((C ^ A) ^ B) -> (A ^ B) | C
> >
> > I.e., `^` is also associative.
>
> Agree with Duncan on including associative operation too.
>
> > Can we handle this by just having a canonical ordering? Or is that too
> difficult to maintain through various instcombines?
>
> Yes, its the easiest way to do that. If i am not wrong, what Sean is
> suggesting is that if we get
>
> something like
>
> (B ^ A) | ((B ^ C) ^ A) -> (A ^ B) | C
>
> and we have written pass for pattern
>
> (A ^ B) | ((B ^ C) ^ A) -> (A ^ B) | C
>
> we could just swap 'B' and 'A' before matching the pattern i.e. we do
> canonical re-ordering
> (Sean, Correct me if i am wrong on my understanding of what you meant by
> canonical ordering).
>

You understood correctly.

>
> I have seen in the "instcombine" module that many times std::swap was used
> for the same.
>
> But it becomes too much repetitive and somewhat irritating to keep doing
> this for every pattern.
> Shouldn't we do this in more generic way where we do not have to worry
> about swapping operands?
> (How to do it - i don't know for now. I looked at the matchers
> implementation for modification/improvement,
> but didn't find any clue on it).
>
>
> > It seems to me that we could rejigger Reassociate to reduce the number
> of possibilities that InstCombine could expect.
>
>>
> Agree with David that infact it should be reassociate pass which should
> handle this.
>
> But following is something interesting and rather worrying things i have
> found :
> (I have omitted unimportant code and highlighted important one in
> following example)
> *e.x. *
>
> ((~A & B) | A)  -->  (A | B) ; Code is implemented for this already
>
> *C code :*
>
> suyog at suyog-Inspiron-N5010:~$ cat 1.c
> #include<stdio.h>
> int cal(int a, int b) {
> *return ((~a & b) | a);*
> }
>
> int main(){
> int a, b;
> scanf("%d %d", &a, &b);
> return cal(a,b);
> }
>
> LLVM IR at O0 :
>
> suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/clang -S -O0 -emit-llvm 1.c
>
> ; Function Attrs: nounwind
>
>
>
>
> *define i32 @cal(i32 %a, i32 %b) #0 {  %1 = xor i32 %a, -1  %2 = and i32
> %1, %b  %3 = or i32 %2, %a  ret i32 %3*
> }
>
> ; Function Attrs: nounwind
> define i32 @main() #0 {
>   ..
>   ..
>   %4 = call i32 @cal(i32 %2, i32 %3)
>   ret i32 %4
> }
>
> *LLVM IR at instcombine :*
>
> suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/opt -S -instcombine 1.ll
>
>
>
>
>
> *; Function Attrs: nounwinddefine i32 @cal(i32 %a, i32 %b) #0 {   %1 = or
> i32 %a, %b  ret i32 %1}*
> ..
> ..
> ..
>
> which is OK as per expected transform.
>
>
>
> *LLVM IR at reassociate and instcombine : *suyog at suyog-Inspiron-N5010:~$
> Open/rbuild/bin/opt -S -reassociate -instcombine 2.ll
>
>
>
>
>
>
>
> *; Function Attrs: nounwinddefine i32 @cal(i32 %a, i32 %b) #0 {  %1 = xor
> i32 %a, -1  %2 = and i32 %b, %1   %3 = or i32 %2, %a  ret i32 %3}*
>
> ..
> ..
>
> Reassociate converted (~a&b) to (b&~a) and instruction combine failed to
> optimize.
>
> As evident, sometimes, reassociate pass actually harms the optimization
> opportunity !!
>
> Some more interesting findings on how can this affect the final machine
> code:
>
> *Test case :*
> 1.c :
>
> #include<stdio.h>
> int cal(int a, int b) {
> *return ((~a & b) | a);*
> }
>
> int main(){
> int a, b;
> scanf("%d %d", &a, &b);
> return cal(a,b);
> }
>
> *X86 .s file with clang at O2 for above program :*
>
> suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/clang -S -O2 1.c
>
> main:                                   # @main
> # BB#0:
>     subl    $28, %esp
>     leal    20(%esp), %eax
>     movl    %eax, 8(%esp)
>     leal    24(%esp), %eax
>     movl    %eax, 4(%esp)
>     movl    $.L.str, (%esp)
>     calll    __isoc99_scanf
>     movl    20(%esp), %eax
> *    orl    24(%esp), %eax*
>     addl    $28, %esp
>     retl
>
> As seen, optimization happened at IR level itself reflected in .s file.
>
> *GCC output for the same:*
>
> suyog at suyog-Inspiron-N5010:~$ gcc -S -O2 1.c
>
> main:
> .LFB23:
>     .cfi_startproc
>     pushl    %ebp
>     .cfi_def_cfa_offset 8
>     .cfi_offset 5, -8
>     movl    %esp, %ebp
>     .cfi_def_cfa_register 5
>     andl    $-16, %esp
>     subl    $32, %esp
>     leal    28(%esp), %eax
>     movl    %eax, 8(%esp)
>     leal    24(%esp), %eax
>     movl    %eax, 4(%esp)
>     movl    $.LC0, (%esp)
>     call    __isoc99_scanf
>     movl    24(%esp), %eax
> *    orl    28(%esp), %eax*
>     leave
>     .cfi_restore 5
>     .cfi_def_cfa 4, 4
>     ret
>     .cfi_endproc
>
> GCC also did the optimization.
>
> Now we just *slightly flip* the test case :
>
>
>
> *1.c Test case:*#include<stdio.h>
> int cal(int a, int b) {
> *return ((b & ~a) | a);*
> }
>
> int main(){
> int a, b;
> scanf("%d %d", &a, &b);
> return cal(a,b);
> }
>
>
>
> *X86 .s file with clang at O2 for above program :*
> suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/clang -S -O2 1.c
>
> main:                                   # @main
> # BB#0:
>     subl    $28, %esp
>     leal    20(%esp), %eax
>     movl    %eax, 8(%esp)
>     leal    24(%esp), %eax
>     movl    %eax, 4(%esp)
>     movl    $.L.str, (%esp)
>     calll    __isoc99_scanf
>     movl    24(%esp), %ecx
>     movl    %ecx, %eax
>
>
> *notl    %eax    andl    20(%esp), %eax    orl    %ecx, %eax*
>     addl    $28, %esp
>     retl
>
> We lost the Optimization opportunity here !!
>
> *GCC output for the same:*
>
> suyog at suyog-Inspiron-N5010:~$ gcc -S -O2 1.c
>
> main:
> .LFB23:
>     .cfi_startproc
>     pushl    %ebp
>     .cfi_def_cfa_offset 8
>     .cfi_offset 5, -8
>     movl    %esp, %ebp
>     .cfi_def_cfa_register 5
>     andl    $-16, %esp
>     subl    $32, %esp
>     leal    28(%esp), %eax
>     movl    %eax, 8(%esp)
>     leal    24(%esp), %eax
>     movl    %eax, 4(%esp)
>     movl    $.LC0, (%esp)
>     call    __isoc99_scanf
>     movl    24(%esp), %eax
> *    orl    28(%esp), %eax*
>     leave
>     .cfi_restore 5
>     .cfi_def_cfa 4, 4
>     ret
>     .cfi_endproc
>
> GCC still optimized it !!
>
> Clearly evident that llvm is losing because of naive approach of pattern
> matching.
>
> How can we improve this so that LLVM also recognizes
> commutative/associative
> pattern  efficiently with single 'matchers' call? I am not having any idea
> so far.
> I will think more on it. Any help would be really helpful.
>
> Any suggestions/comments/rectification are most awaited :)
>

I'm actually sort of wondering if instead of hardcoding all these bitwise
simplifications, we can just use a general boolean minimization algorithm.

Is there a reason we don't? Maybe it's just compile time? But eventually
(maybe already?) adding enough special case combines will be slower than
using the general optimization.

-- Sean

>
> --
> With regards,
> Suyog Sarda
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140811/83867903/attachment.html>