<div dir="ltr">Thanks Hans.<div><br></div><div>I did look at LoopIdiomRecognize and also Loop rerolling - but like you said, they only work on loop. And having a control flow makes it a bit complicated. I have test cases that help in X86 and Aarch64. I will put up my patch for review and you guys can have a look. </div><div><br></div><div>Another thing, I noticed is that there is no __builtin_memcmp in llvm. Maybe that will be good to have as well.</div><div><br></div><div>Sirish</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Fri, May 19, 2017 at 3:06 PM, Hans Wennborg <span dir="ltr"><<a href="mailto:hans@chromium.org" target="_blank">hans@chromium.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="HOEnZb"><div class="h5">On Fri, May 19, 2017 at 12:46 PM, Sirish Pande via llvm-dev<br>

<<a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>> wrote:<br>

> Hi,<br>

><br>

> Look at the following code:<br>

><br>

> Look at the following C code  seqence:<br>

><br>

> unsigned char mainGtU ( unsigned int i1,<br>

>                unsigned int i2,<br>

>                unsigned char* block)<br>

> {<br>

>    unsigned char c1, c2;<br>

>    c1 = block[i1]; c2 = block[i2];<br>

>    if (c1 != c2) return (c1 > c2);<br>

>    i1++; i2++;<br>

><br>

>    c1 = block[i1]; c2 = block[i2];<br>

>    if (c1 != c2) return (c1 > c2);<br>

>    i1++; i2++;<br>

><br>

> ..<br>

> ..<br>

> <repeat 12 times><br>

><br>

> In LLVM IR it will be following:<br>

><br>

> define i8 @mainGtU(i32 %i1, i32 %i2, i8* readonly %block, i16* nocapture<br>

> readnone %quadrant, i32 %nblock, i32* nocapture readnone %budget)<br>

> local_unnamed_addr #0 {<br>

> entry:<br>

>   %idxprom = zext i32 %i1 to i64<br>

>   %arrayidx = getelementptr inbounds i8, i8* %block, i64 %idxprom<br>

>   %0 = load i8, i8* %arrayidx, align 1<br>

>   %idxprom1 = zext i32 %i2 to i64<br>

>   %arrayidx2 = getelementptr inbounds i8, i8* %block, i64 %idxprom1<br>

>   %1 = load i8, i8* %arrayidx2, align 1<br>

>   %cmp = icmp eq i8 %0, %1<br>

>   br i1 %cmp, label %if.end, label %if.then<br>

><br>

> if.then:                                          ; preds = %entry<br>

>   %cmp7 = icmp ugt i8 %0, %1<br>

>   br label %return<br>

><br>

> if.end:                                           ; preds = %entry<br>

>   %inc = add i32 %i1, 1<br>

>   %inc10 = add i32 %i2, 1<br>

>   %idxprom11 = zext i32 %inc to i64<br>

>   %arrayidx12 = getelementptr inbounds i8, i8* %block, i64 %idxprom11<br>

>   %2 = load i8, i8* %arrayidx12, align 1<br>

>   %idxprom13 = zext i32 %inc10 to i64<br>

>   %arrayidx14 = getelementptr inbounds i8, i8* %block, i64 %idxprom13<br>

>   %3 = load i8, i8* %arrayidx14, align 1<br>

>   %cmp17 = icmp eq i8 %2, %3<br>

>   br i1 %cmp17, label %if.end25, label %if.then19<br>

><br>

> if.then19:                                        ; preds = %if.end<br>

>   %cmp22 = icmp ugt i8 %2, %3<br>

>   br label %return<br>

><br>

> ..<br>

> ..<br>

> <repeats 12 times><br>

><br>

> This code sequence can be collapsed into call to  memcmp and we can get rid<br>

> of basic blocks. I have written a small peephole optimization for squenece<br>

> of instructions that identifies<br>

> branch termiantor, compare, load, gep etc and converts them to a call to<br>

> memcmp. This small pass gave me improvement of 67% on SPEC2000 bzip2 on X86.<br>

><br>

> Is there a better idea, other than small peephole pass on IR to optimize<br>

> this code?<br>

<br>

</div></div>There is LoopIdiomRecognize which does transformations like this, but<br>

only for loops, not unrolled code like your example.<br>

<br>

It would be very cool if we could somehow make that pass also<br>

recognize unrolled patterns, both for memcmp, and other operations.<br>

<br>

I don't have any specific ideas for how to do that, but the<br>

improvement you saw suggests it might be very worthwhile :-)<br>

</blockquote></div><br></div>