[LLVMdev] Efficient Pattern matching in Instruction Combine
Sean Silva
chisophugis at gmail.com
Tue Aug 12 14:00:58 PDT 2014
Re-adding the mailing list (remember to hit "reply all")
On Tue, Aug 12, 2014 at 9:36 AM, suyog sarda <sardask01 at gmail.com> wrote:
> Thanks Sean for the reply.
>
>
> On Mon, Aug 11, 2014 at 11:49 PM, Sean Silva <chisophugis at gmail.com>
> wrote:
>
>>
>>
>>
>> On Fri, Aug 8, 2014 at 1:34 PM, suyog sarda <sardask01 at gmail.com> wrote:
>>
>>> Hi Duncan, David, Sean.
>>>
>>> Thanks for your reply.
>>>
>>>
>>> > It'd be interesting if you could find a design that also treated these
>>> > the same:
>>> >
>>> > (B ^ A) | ((A ^ B) ^ C) -> (A ^ B) | C
>>>
>>> > (B ^ A) | ((B ^ C) ^ A) -> (A ^ B) | C
>>> > (B ^ A) | ((C ^ A) ^ B) -> (A ^ B) | C
>>> >
>>> > I.e., `^` is also associative.
>>>
>>> Agree with Duncan on including associative operation too.
>>>
>>> > Can we handle this by just having a canonical ordering? Or is that too
>>> difficult to maintain through various instcombines?
>>>
>>> Yes, its the easiest way to do that. If i am not wrong, what Sean is
>>> suggesting is that if we get
>>>
>>> something like
>>>
>>> (B ^ A) | ((B ^ C) ^ A) -> (A ^ B) | C
>>>
>>> and we have written pass for pattern
>>>
>>> (A ^ B) | ((B ^ C) ^ A) -> (A ^ B) | C
>>>
>>> we could just swap 'B' and 'A' before matching the pattern i.e. we do
>>> canonical re-ordering
>>> (Sean, Correct me if i am wrong on my understanding of what you meant by
>>> canonical ordering).
>>>
>>
>> You understood correctly.
>>
>>
>>>
>>> I have seen in the "instcombine" module that many times std::swap was
>>> used for the same.
>>>
>>> But it becomes too much repetitive and somewhat irritating to keep doing
>>> this for every pattern.
>>> Shouldn't we do this in more generic way where we do not have to worry
>>> about swapping operands?
>>> (How to do it - i don't know for now. I looked at the matchers
>>> implementation for modification/improvement,
>>> but didn't find any clue on it).
>>>
>>>
>>> > It seems to me that we could rejigger Reassociate to reduce the number
>>> of possibilities that InstCombine could expect.
>>>
>>>>
>>> Agree with David that infact it should be reassociate pass which should
>>> handle this.
>>>
>>> But following is something interesting and rather worrying things i have
>>> found :
>>> (I have omitted unimportant code and highlighted important one in
>>> following example)
>>> *e.x. *
>>>
>>> ((~A & B) | A) --> (A | B) ; Code is implemented for this already
>>>
>>> *C code :*
>>>
>>> suyog at suyog-Inspiron-N5010:~$ cat 1.c
>>> #include<stdio.h>
>>> int cal(int a, int b) {
>>> *return ((~a & b) | a);*
>>> }
>>>
>>> int main(){
>>> int a, b;
>>> scanf("%d %d", &a, &b);
>>> return cal(a,b);
>>> }
>>>
>>> LLVM IR at O0 :
>>>
>>> suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/clang -S -O0 -emit-llvm
>>> 1.c
>>>
>>> ; Function Attrs: nounwind
>>>
>>>
>>>
>>>
>>> *define i32 @cal(i32 %a, i32 %b) #0 { %1 = xor i32 %a, -1 %2 = and i32
>>> %1, %b %3 = or i32 %2, %a ret i32 %3*
>>> }
>>>
>>> ; Function Attrs: nounwind
>>> define i32 @main() #0 {
>>> ..
>>> ..
>>> %4 = call i32 @cal(i32 %2, i32 %3)
>>> ret i32 %4
>>> }
>>>
>>> *LLVM IR at instcombine :*
>>>
>>> suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/opt -S -instcombine 1.ll
>>>
>>>
>>>
>>>
>>>
>>> *; Function Attrs: nounwinddefine i32 @cal(i32 %a, i32 %b) #0 { %1 =
>>> or i32 %a, %b ret i32 %1}*
>>> ..
>>> ..
>>> ..
>>>
>>> which is OK as per expected transform.
>>>
>>>
>>>
>>> *LLVM IR at reassociate and instcombine : *suyog at suyog-Inspiron-N5010:~$
>>> Open/rbuild/bin/opt -S -reassociate -instcombine 2.ll
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> *; Function Attrs: nounwinddefine i32 @cal(i32 %a, i32 %b) #0 { %1 =
>>> xor i32 %a, -1 %2 = and i32 %b, %1 %3 = or i32 %2, %a ret i32 %3}*
>>>
>>> ..
>>> ..
>>>
>>> Reassociate converted (~a&b) to (b&~a) and instruction combine failed to
>>> optimize.
>>>
>>> As evident, sometimes, reassociate pass actually harms the optimization
>>> opportunity !!
>>>
>>> Some more interesting findings on how can this affect the final machine
>>> code:
>>>
>>> *Test case :*
>>> 1.c :
>>>
>>> #include<stdio.h>
>>> int cal(int a, int b) {
>>> *return ((~a & b) | a);*
>>> }
>>>
>>> int main(){
>>> int a, b;
>>> scanf("%d %d", &a, &b);
>>> return cal(a,b);
>>> }
>>>
>>> *X86 .s file with clang at O2 for above program :*
>>>
>>> suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/clang -S -O2 1.c
>>>
>>> main: # @main
>>> # BB#0:
>>> subl $28, %esp
>>> leal 20(%esp), %eax
>>> movl %eax, 8(%esp)
>>> leal 24(%esp), %eax
>>> movl %eax, 4(%esp)
>>> movl $.L.str, (%esp)
>>> calll __isoc99_scanf
>>> movl 20(%esp), %eax
>>> * orl 24(%esp), %eax*
>>> addl $28, %esp
>>> retl
>>>
>>> As seen, optimization happened at IR level itself reflected in .s file.
>>>
>>> *GCC output for the same:*
>>>
>>> suyog at suyog-Inspiron-N5010:~$ gcc -S -O2 1.c
>>>
>>> main:
>>> .LFB23:
>>> .cfi_startproc
>>> pushl %ebp
>>> .cfi_def_cfa_offset 8
>>> .cfi_offset 5, -8
>>> movl %esp, %ebp
>>> .cfi_def_cfa_register 5
>>> andl $-16, %esp
>>> subl $32, %esp
>>> leal 28(%esp), %eax
>>> movl %eax, 8(%esp)
>>> leal 24(%esp), %eax
>>> movl %eax, 4(%esp)
>>> movl $.LC0, (%esp)
>>> call __isoc99_scanf
>>> movl 24(%esp), %eax
>>> * orl 28(%esp), %eax*
>>> leave
>>> .cfi_restore 5
>>> .cfi_def_cfa 4, 4
>>> ret
>>> .cfi_endproc
>>>
>>> GCC also did the optimization.
>>>
>>> Now we just *slightly flip* the test case :
>>>
>>>
>>>
>>> *1.c Test case:*#include<stdio.h>
>>> int cal(int a, int b) {
>>> *return ((b & ~a) | a);*
>>> }
>>>
>>> int main(){
>>> int a, b;
>>> scanf("%d %d", &a, &b);
>>> return cal(a,b);
>>> }
>>>
>>>
>>>
>>> *X86 .s file with clang at O2 for above program :*
>>> suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/clang -S -O2 1.c
>>>
>>> main: # @main
>>> # BB#0:
>>> subl $28, %esp
>>> leal 20(%esp), %eax
>>> movl %eax, 8(%esp)
>>> leal 24(%esp), %eax
>>> movl %eax, 4(%esp)
>>> movl $.L.str, (%esp)
>>> calll __isoc99_scanf
>>> movl 24(%esp), %ecx
>>> movl %ecx, %eax
>>>
>>>
>>> *notl %eax andl 20(%esp), %eax orl %ecx, %eax*
>>> addl $28, %esp
>>> retl
>>>
>>> We lost the Optimization opportunity here !!
>>>
>>> *GCC output for the same:*
>>>
>>> suyog at suyog-Inspiron-N5010:~$ gcc -S -O2 1.c
>>>
>>> main:
>>> .LFB23:
>>> .cfi_startproc
>>> pushl %ebp
>>> .cfi_def_cfa_offset 8
>>> .cfi_offset 5, -8
>>> movl %esp, %ebp
>>> .cfi_def_cfa_register 5
>>> andl $-16, %esp
>>> subl $32, %esp
>>> leal 28(%esp), %eax
>>> movl %eax, 8(%esp)
>>> leal 24(%esp), %eax
>>> movl %eax, 4(%esp)
>>> movl $.LC0, (%esp)
>>> call __isoc99_scanf
>>> movl 24(%esp), %eax
>>> * orl 28(%esp), %eax*
>>> leave
>>> .cfi_restore 5
>>> .cfi_def_cfa 4, 4
>>> ret
>>> .cfi_endproc
>>>
>>> GCC still optimized it !!
>>>
>>> Clearly evident that llvm is losing because of naive approach of pattern
>>> matching.
>>>
>>> How can we improve this so that LLVM also recognizes
>>> commutative/associative
>>> pattern efficiently with single 'matchers' call? I am not having any
>>> idea so far.
>>> I will think more on it. Any help would be really helpful.
>>>
>>> Any suggestions/comments/rectification are most awaited :)
>>>
>>
>>
>> I'm actually sort of wondering if instead of hardcoding all these bitwise
>> simplifications, we can just use a general boolean minimization algorithm.
>>
>> Is there a reason we don't? Maybe it's just compile time? But eventually
>> (maybe already?) adding enough special case combines will be slower than
>> using the general optimization.
>>
>
> Can you please elaborate more on "general boolean minimization algorithm"
> ? Is it a specific algorithm? I couldn't find any proper
> reference to it. If you can provide a reference for it, i will try to see
> if i can implement it.
>
> Can K-Map be an option to generalize the logic?
> (I could think of K-Map for now for simplification from my college days
> :))
>
Yeah, K-Maps are one way to do it, however K-Maps are really most useful
for when your target operations are all NAND/NOR, which isn't the case here
(and you can also do Reed-Muller for XOR/XNOR); instcombine has a very
different set of "primitive operations" and their relative costs (i.e. XOR,
AND, OR, and NOT are usually the base operations and cost the same amount
of resources to execute).
Since optimization of binary expressions is of such enormous importance to
the electronics industry, this topic is extremely well researched, so there
should be a lot to draw from. There may be stuff targeted specifically at
software compilers, but I haven't looked in depth (maybe try Muchnick
section 12.3 for references? I don't have the book handy).
For starters, you may want to check out Knuth volume 4A, section 7.1. I'm
working from memory here, but IIRC there is a whole part about
minimization/optimization of binary expressions. Knuth as usual provides
tons of references to the literature in case the text there is not enough.
You can see preprint copies of volume 4A here:
http://www.cs.utsa.edu/~wagner/knuth/ (although Knuth is the sort of book
that any programmer should have on their bookshelf; in fact, you can
probably justify it as a work expense).
>
> Also few more things i will like to understand.
> From above example i have shown,
>
> when we have test case as:
>
> #include<stdio.h>
> int cal(int a, int b) {
> *return ((~a & b) | a);*
> }
>
> int main(){
> int a, b;
> scanf("%d %d", &a, &b);
> return cal(a,b);
> }
>
> *X86 .s file with clang at O2 for above program :*
>
> main: # @main
> # BB#0:
> subl $28, %esp
> leal 20(%esp), %eax
> movl %eax, 8(%esp)
> leal 24(%esp), %eax
> movl %eax, 4(%esp)
> movl $.L.str, (%esp)
> calll __isoc99_scanf
> movl 20(%esp), %eax
> * orl 24(%esp), %eax*
> addl $28, %esp
> retl
>
> As seen, we optimize the pattern.
>
> but when we have test case as :
>
> #include<stdio.h>
> int cal(int a, int b) {
> *return ((b & ~a) | a);*
> }
>
> int main(){
> int a, b;
> scanf("%d %d", &a, &b);
> return cal(a,b);
> }
>
>
>
> *X86 .s file with clang at O2 for above program :*
> suyog at suyog-Inspiron-N5010:~$ Open/rbuild/bin/clang -S -O2 1.c
>
> main: # @main
> # BB#0:
> subl $28, %esp
> leal 20(%esp), %eax
> movl %eax, 8(%esp)
> leal 24(%esp), %eax
> movl %eax, 4(%esp)
> movl $.L.str, (%esp)
> calll __isoc99_scanf
> movl 24(%esp), %ecx
> movl %ecx, %eax
>
>
> *notl %eax andl 20(%esp), %eax orl %ecx, %eax*
> addl $28, %esp
> retl
>
> we fail to optimize.
>
> Does, 'reassociate' pass always run before 'instcombine' ?
> Theoretically, it should in my opinion. But from the output in first case,
> it seems it doesn't.
> If 'reassociate' would have run before 'instcombine', it would have
> converted (~a&b) to (b&~a)
> and the pattern would have not been optimized.
>
> Is there any logic which determines when to run reassociate and when not
> to run?
>
I don't know off the top of my head. Try running opt with
-debug-pass=Arguments to see what the order is (or I think you should be
able to pass -debug-pass=Arguments to clang's -mllvm option).
-- Sean Silva
>
> Duncan, David, Sean - any comment on this are most welcomed :)
>
> Regards,
> Suyog
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140812/41be749b/attachment.html>
More information about the llvm-dev
mailing list