<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Aug 2, 2017 at 3:36 PM, Matthias Braun <span dir="ltr"><<a href="mailto:mbraun@apple.com" target="_blank">mbraun@apple.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">So to write this in a more condensed form, you have:<br>

<br>

%v0 = ...<br>

%v1 = and %v0, 255<br>

%v2 = and %v1, 31<br>

use %v1<br>

use %v2<br>

<br>

and transform this to<br>

%v0 = ...<br>

%v1 = and %v0, 255<br>

%v2 = and %v0, 31<br>

...<br>

<br>

This is a classical problem with instruction combining. When you replace a non-root node of a pattern (in this case %v2 would be the root pattern we match, but we actually replace %v1 with %v0 as part of the replacement) and that non-root node has multiple users then it becomes very hard to predict whether the transformations pays off in the end.<br>

<br>

In my experience it is best to restrict those sort of transformation to situations where you can prove there aren't multiple users on the intermediate/non-root node that we replace. (Or if you want to get fancier that all users<br>

will match the same pattern).<br></blockquote><div><br></div><div><br></div><div>I agree. The propagation happens in the hope that v1 has only one use so that '%v1 = ..' can be later eliminated, or that v1 has only two uses such that the propagation can shrink the live range of %v1, saving one move instruction for the other use of %v1.   In  reality, this is like flip a coin and likely to hurt performance on x86.</div><div><br></div><div>Avoid doing this one the fly, but in a separate pass might be the way to go.</div><div><br></div><div>David</div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

- Matthias<br>

<div><div class="h5"><br>

> On Aug 2, 2017, at 3:00 PM, Wei Mi via llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>> wrote:<br>

><br>

> Hi,<br>

><br>

> We recently found a testcase showing that simplifications in<br>

> instcombine sometimes change the instruction without reducing the<br>

> instruction cost,  but causing problems in TwoAddressInstruction pass.<br>

> And it looks like the problem is generic and other simplification may<br>

> have the same issue. I want to get some ideas about what is the best<br>

> way to fix such kind of problem.<br>

><br>

> The testcase:<br>

> ----------------------------- a.ll ------------------------------<wbr>----<br>

> @a = global i64 0, align 8<br>

> @b = global i32 0, align 8<br>

><br>

> define i32 @_<wbr>Z25InternalUncompressAllTagsv(<wbr>i16* %arrayidx) local_unnamed_addr {<br>

> entry:<br>

>  %t1 = load i16, i16* %arrayidx, align 2<br>

>  %conv.i = zext i16 %t1 to i32<br>

>  %and.i = and i32 %conv.i, 255<br>

>  %and7.i = and i32 %conv.i, 1792<br>

>  %t3 = zext i32 %and7.i to i64<br>

>  %t4 = load i64, i64* @a, align 8<br>

>  %add.i = add i64 %t4, %t3<br>

>  %cmp1 = icmp eq i64 %add.i, 1<br>

>  %cmp2 = icmp ult i32 %and.i, 101<br>

>  %bool = and i1 %cmp1, %cmp2<br>

>  br i1 %bool, label %if.then, label %if.else, !prof !0<br>

><br>

> if.then:                                          ; preds = %entry<br>

>  %r1 = trunc i64 %add.i to i32<br>

>  br label %return<br>

><br>

> if.else:                                          ; preds = %entry<br>

>  %r2 = and i32 %and.i, 31<br>

>  store i32 %and.i, i32* @b, align 8<br>

>  br label %return<br>

><br>

> return:                                           ; preds = %if.else, %if.then<br>

>  %ret = phi i32 [ %r1, %if.then ], [ %r2, %if.else ]<br>

>  ret i32 %ret<br>

> }<br>

><br>

> !0 = !{!"branch_weights", i32 2000, i32 1}<br>

> ------------------------------<wbr>------------------------------<wbr>-------------<br>

><br>

> For the snippet:<br>

> %and.i = and i32 %conv.i, 255<br>

>  ...<br>

> %r2 = and i32 %and.i, 31<br>

><br>

> Look at %r2 in block %if.else, it is computed by two "and" operations.<br>

> Both InstCombiner::<wbr>SimplifyAssociativeOrCommutati<wbr>ve and<br>

> InstCombiner::<wbr>SimplifyDemandedUseBits can replace %r2 = and i32<br>

> %and.i, 31 with %r2 = and i32 %conv.i, 31. Because %and.i has many<br>

> other uses and it won't become dead after the simplifications, those<br>

> simplifications won't simplify instruction cost, but they will change<br>

> the live range of variables:<br>

> Before instcombine, %conv.i is last used at %and7.i = and i32 %conv.i,<br>

> 1792, so on architecture like x86, llvm won't generate extra mov in<br>

> TwoAddressInstruction pass for it. After instcombine, %conv.i's live<br>

> range is extended, both %and7.i and %conv.i are alive after the "and"<br>

> instruction, so TwoAddressInstruction pass will generate an extra mov<br>

> for %and7.i = and i32 %conv.i.<br>

><br>

> *** After instcombine: ***<br>

> ~/workarea/llvm-r309240/<wbr>rbuild1/bin/opt -instcombine -S < a.ll > b.ll; cat b.ll<br>

> @a = global i64 0, align 8<br>

> @b = global i32 0, align 8<br>

><br>

> define i32 @_<wbr>Z25InternalUncompressAllTagsv(<wbr>i16* %arrayidx) local_unnamed_addr {<br>

> entry:<br>

>  %t1 = load i16, i16* %arrayidx, align 2<br>

>  %conv.i = zext i16 %t1 to i32<br>

>  %and.i = and i32 %conv.i, 255<br>

>  %and7.i = and i32 %conv.i, 1792              ============> %conv.i<br>

> is alive since its live range is extended.<br>

>  %t3 = zext i32 %and7.i to i64<br>

>         We need a extra mov for this two address instruction on x86<br>

>  %t4 = load i64, i64* @a, align 8<br>

>  %add.i = add i64 %t4, %t3<br>

>  %cmp1 = icmp eq i64 %add.i, 1<br>

>  %cmp2 = icmp ult i32 %and.i, 101<br>

>  %bool = and i1 %cmp1, %cmp2<br>

>  br i1 %bool, label %if.then, label %if.else<br>

><br>

> if.then:                                          ; preds = %entry<br>

>  %r1 = trunc i64 %add.i to i32<br>

>  br label %return<br>

><br>

> if.else:                                          ; preds = %entry<br>

>  %r2 = and i32 %conv.i, 31<br>

> ===============> %and.i is replaced by %conv.i.<br>

>  store i32 %and.i, i32* @b, align 8<br>

>  br label %return<br>

><br>

> return:                                           ; preds = %if.else, %if.then<br>

>  %ret = phi i32 [ %r1, %if.then ], [ %r2, %if.else ]<br>

>  ret i32 %ret<br>

> }<br>

><br>

> *** asm code without instcombine: ***<br>

> ~/workarea/llvm-r309240/<wbr>rbuild1/bin/llc < a.ll<br>

> # BB#0:                                 # %entry<br>

>        movzwl  (%rdi), %ecx<br>

>        movzbl  %cl, %eax<br>

>        andl    $1792, %ecx             # imm = 0x700<br>

>        addq    a(%rip), %rcx<br>

>        cmpq    $1, %rcx<br>

>        jne     .LBB0_3<br>

><br>

> *** asm code with instcombine: ***<br>

> ~/workarea/llvm-r309240/<wbr>rbuild1/bin/llc < b.ll<br>

> # BB#0:                                 # %entry<br>

>        movzwl  (%rdi), %eax<br>

>        movzbl  %al, %ecx<br>

>        movl    %eax, %edx                =====>  One extra move<br>

> instruction in the hot path<br>

>        andl    $1792, %edx             # imm = 0x700<br>

>        addq    a(%rip), %rdx<br>

>        cmpq    $1, %rdx<br>

>        jne     .LBB0_3<br>

><br>

> I am not sure if it is possible to tackle the problem in<br>

> simplification or instcombine directly, so I explored the possibility<br>

> to do the reversal transformation in CodeGen, but I found that not<br>

> quite easy either because of two things:<br>

> * Seems to me Peephole pass or TwoAddrInstruction pass are the proper<br>

> places to do the reversal transformation. We need Live Interval<br>

> information to justify the transformation, however currently live<br>

> Interval information is not ready in both places.<br>

><br>

> * The pattern matching looks quite ad hoc on machine IR. I need to<br>

> figure out we can replace %vreg0 in "AND32ri8 %vreg0<tied0>, 31" with<br>

> %vreg1 by looking at the copy chain starting from %vreg9<def> = COPY<br>

> %vreg0 to %vreg1<def> = MOVZX32rr8 %vreg9 first, and at the same time,<br>

> after replacing vreg0 with %vreg1, vreg0 becomes dead at the other<br>

> AND32ri and we can save an instruction there. In addition, replace<br>

> %vreg0 with %vreg1 may increase an extra move before "AND32ri8<br>

> %vreg0<tied0>, 31", so we still need to check "AND32ri8 %vreg0<tied0>,<br>

> 31" is colder than "AND32ri %vreg0<tied0>, 1792".<br>

>  All these efforts are just handling a specific pattern, and if the<br>

> pattern changes a little bit, they won't work.<br>

> BB_i:<br>

>        %vreg9<def> = COPY %vreg0:sub_8bit; GR8:%vreg9 GR32:%vreg0<br>

>        %vreg1<def> = MOVZX32rr8 %vreg9<kill>; GR32:%vreg1 GR8:%vreg9<br>

>        %vreg10<def,tied1> = AND32ri %vreg0<tied0>, 1792,<br>

> %EFLAGS<imp-def,dead>; GR32:%vreg10,%vreg0<br>

> ...<br>

> BB_j:<br>

>        %vreg4<def,tied1> = AND32ri8 %vreg0<tied0>, 31,<br>

> %EFLAGS<imp-def,dead>; GR32:%vreg4,%vreg0<br>

>        MOV32mr %RIP, 1, %noreg, <ga:@b>, %noreg, %vreg1;<br>

> mem:ST4[@b](align=8) GR32:%vreg1<br>

><br>

> Any suggestions are welcomed.<br>

><br>

> Thanks,<br>

> Wei.<br>

</div></div>> ______________________________<wbr>_________________<br>

> LLVM Developers mailing list<br>

> <a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a><br>

> <a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/<wbr>mailman/listinfo/llvm-dev</a><br>

<br>

</blockquote></div><br></div></div>