[llvm-dev] [InstCombine] Simplification sometimes only transforms but doesn't simplify instruction, causing side effect in other pass

Wed Aug 2 16:59:40 PDT 2017

On Wed, Aug 2, 2017 at 4:07 PM, Matthias Braun <mbraun at apple.com> wrote:
>
> On Aug 2, 2017, at 4:00 PM, Wei Mi <wmi at google.com> wrote:
>
> On Wed, Aug 2, 2017 at 3:36 PM, Matthias Braun <mbraun at apple.com> wrote:
>
> So to write this in a more condensed form, you have:
>
> %v0 = ...
> %v1 = and %v0, 255
> %v2 = and %v1, 31
> use %v1
> use %v2
>
> and transform this to
> %v0 = ...
> %v1 = and %v0, 255
> %v2 = and %v0, 31
> ...
>
> This is a classical problem with instruction combining. When you replace a
> non-root node of a pattern (in this case %v2 would be the root pattern we
> match, but we actually replace %v1 with %v0 as part of the replacement) and
> that non-root node has multiple users then it becomes very hard to predict
> whether the transformations pays off in the end.
>
> In my experience it is best to restrict those sort of transformation to
> situations where you can prove there aren't multiple users on the
> intermediate/non-root node that we replace. (Or if you want to get fancier
> that all users
> will match the same pattern).
>
> - Matthias
>
>
> Thanks Matthias, yes, the transformation in instcombine is of non-root
> node pattern. What I am worried about to add the restriction that %v1
> has to have only one use is that such change may regress some
> benchmarks. For the same case, after we add the limitation, we will
> save a move in BB1, but may add another move in BB2, because %v2 and
> %v1 are both alive after %v2 = and %v1, 31.  Which choice is better
> depends on the hotness of BB1 and BB2, so the result of such
> transformation in instcombine is like flipping a coin.  Currently we
> don't have a pass to use BFI to do such transformation in a meaningful
> way. Maybe we need such a pass, and with that pass, we can disable the
> transformations in instcombine without regressing certain benchmarks.
>
> Yes this strategy will regress some situations and also goes against the
> LLVM spirit of perform as much normalization as possible in the middle end.
>
> On the other hand eagerly transform all such opportunities has more
> potential to regress things than to improve things in my experience,
> especially since we currently have no way of reverting this transformation.
> So at the very least changing the strategy would be a very interesting
> experiment and at least in another compiler I worked on we saw a huge number
> of improvements and nearly no regressions when switching to this strategy of
> not transforming in the presence of multiple users on a non-root node.
>
> - Matthias

Thanks for sharing the experience! I will try the same and see how the
performance looks like.

Wei.

>
>
> BB1:
>
> %v0 = ...
> %v1 = and %v0, 255
>
>
> BB2:
>
> %v2 = and %v1, 31
> use %v1
> use %v2
>
>
> Wei.
>
>
> On Aug 2, 2017, at 3:00 PM, Wei Mi via llvm-dev <llvm-dev at lists.llvm.org>
> wrote:
>
> Hi,
>
> We recently found a testcase showing that simplifications in
> instcombine sometimes change the instruction without reducing the
> instruction cost,  but causing problems in TwoAddressInstruction pass.
> And it looks like the problem is generic and other simplification may
> have the same issue. I want to get some ideas about what is the best
> way to fix such kind of problem.
>
> The testcase:
> ----------------------------- a.ll ----------------------------------
> @a = global i64 0, align 8
> @b = global i32 0, align 8
>
> define i32 @_Z25InternalUncompressAllTagsv(i16* %arrayidx)
> local_unnamed_addr {
> entry:
> %t1 = load i16, i16* %arrayidx, align 2
> %conv.i = zext i16 %t1 to i32
> %and.i = and i32 %conv.i, 255
> %and7.i = and i32 %conv.i, 1792
> %t3 = zext i32 %and7.i to i64
> %t4 = load i64, i64* @a, align 8
> %add.i = add i64 %t4, %t3
> %cmp1 = icmp eq i64 %add.i, 1
> %cmp2 = icmp ult i32 %and.i, 101
> %bool = and i1 %cmp1, %cmp2
> br i1 %bool, label %if.then, label %if.else, !prof !0
>
> if.then:                                          ; preds = %entry
> %r1 = trunc i64 %add.i to i32
> br label %return
>
> if.else:                                          ; preds = %entry
> %r2 = and i32 %and.i, 31
> store i32 %and.i, i32* @b, align 8
> br label %return
>
> return:                                           ; preds = %if.else,
> %if.then
> %ret = phi i32 [ %r1, %if.then ], [ %r2, %if.else ]
> ret i32 %ret
> }
>
> !0 = !{!"branch_weights", i32 2000, i32 1}
> -------------------------------------------------------------------------
>
> For the snippet:
> %and.i = and i32 %conv.i, 255
> ...
> %r2 = and i32 %and.i, 31
>
> Look at %r2 in block %if.else, it is computed by two "and" operations.
> Both InstCombiner::SimplifyAssociativeOrCommutative and
> InstCombiner::SimplifyDemandedUseBits can replace %r2 = and i32
> %and.i, 31 with %r2 = and i32 %conv.i, 31. Because %and.i has many
> other uses and it won't become dead after the simplifications, those
> simplifications won't simplify instruction cost, but they will change
> the live range of variables:
> Before instcombine, %conv.i is last used at %and7.i = and i32 %conv.i,
> 1792, so on architecture like x86, llvm won't generate extra mov in
> TwoAddressInstruction pass for it. After instcombine, %conv.i's live
> range is extended, both %and7.i and %conv.i are alive after the "and"
> instruction, so TwoAddressInstruction pass will generate an extra mov
> for %and7.i = and i32 %conv.i.
>
> *** After instcombine: ***
> ~/workarea/llvm-r309240/rbuild1/bin/opt -instcombine -S < a.ll > b.ll; cat
> b.ll
> @a = global i64 0, align 8
> @b = global i32 0, align 8
>
> define i32 @_Z25InternalUncompressAllTagsv(i16* %arrayidx)
> local_unnamed_addr {
> entry:
> %t1 = load i16, i16* %arrayidx, align 2
> %conv.i = zext i16 %t1 to i32
> %and.i = and i32 %conv.i, 255
> %and7.i = and i32 %conv.i, 1792              ============> %conv.i
> is alive since its live range is extended.
> %t3 = zext i32 %and7.i to i64
>        We need a extra mov for this two address instruction on x86
> %t4 = load i64, i64* @a, align 8
> %add.i = add i64 %t4, %t3
> %cmp1 = icmp eq i64 %add.i, 1
> %cmp2 = icmp ult i32 %and.i, 101
> %bool = and i1 %cmp1, %cmp2
> br i1 %bool, label %if.then, label %if.else
>
> if.then:                                          ; preds = %entry
> %r1 = trunc i64 %add.i to i32
> br label %return
>
> if.else:                                          ; preds = %entry
> %r2 = and i32 %conv.i, 31
> ===============> %and.i is replaced by %conv.i.
> store i32 %and.i, i32* @b, align 8
> br label %return
>
> return:                                           ; preds = %if.else,
> %if.then
> %ret = phi i32 [ %r1, %if.then ], [ %r2, %if.else ]
> ret i32 %ret
> }
>
> *** asm code without instcombine: ***
> ~/workarea/llvm-r309240/rbuild1/bin/llc < a.ll
> # BB#0:                                 # %entry
>       movzwl  (%rdi), %ecx
>       movzbl  %cl, %eax
>       andl    $1792, %ecx             # imm = 0x700
>       addq    a(%rip), %rcx
>       cmpq    $1, %rcx
>       jne     .LBB0_3
>
> *** asm code with instcombine: ***
> ~/workarea/llvm-r309240/rbuild1/bin/llc < b.ll
> # BB#0:                                 # %entry
>       movzwl  (%rdi), %eax
>       movzbl  %al, %ecx
>       movl    %eax, %edx                =====>  One extra move
> instruction in the hot path
>       andl    $1792, %edx             # imm = 0x700
>       addq    a(%rip), %rdx
>       cmpq    $1, %rdx
>       jne     .LBB0_3
>
> I am not sure if it is possible to tackle the problem in
> simplification or instcombine directly, so I explored the possibility
> to do the reversal transformation in CodeGen, but I found that not
> quite easy either because of two things:
> * Seems to me Peephole pass or TwoAddrInstruction pass are the proper
> places to do the reversal transformation. We need Live Interval
> information to justify the transformation, however currently live
> Interval information is not ready in both places.
>
> * The pattern matching looks quite ad hoc on machine IR. I need to
> figure out we can replace %vreg0 in "AND32ri8 %vreg0<tied0>, 31" with
> %vreg1 by looking at the copy chain starting from %vreg9<def> = COPY
> %vreg0 to %vreg1<def> = MOVZX32rr8 %vreg9 first, and at the same time,
> after replacing vreg0 with %vreg1, vreg0 becomes dead at the other
> AND32ri and we can save an instruction there. In addition, replace
> %vreg0 with %vreg1 may increase an extra move before "AND32ri8
> %vreg0<tied0>, 31", so we still need to check "AND32ri8 %vreg0<tied0>,
> 31" is colder than "AND32ri %vreg0<tied0>, 1792".
> All these efforts are just handling a specific pattern, and if the
> pattern changes a little bit, they won't work.
> BB_i:
>       %vreg9<def> = COPY %vreg0:sub_8bit; GR8:%vreg9 GR32:%vreg0
>       %vreg1<def> = MOVZX32rr8 %vreg9<kill>; GR32:%vreg1 GR8:%vreg9
>       %vreg10<def,tied1> = AND32ri %vreg0<tied0>, 1792,
> %EFLAGS<imp-def,dead>; GR32:%vreg10,%vreg0
> ...
> BB_j:
>       %vreg4<def,tied1> = AND32ri8 %vreg0<tied0>, 31,
> %EFLAGS<imp-def,dead>; GR32:%vreg4,%vreg0
>       MOV32mr %RIP, 1, %noreg, <ga:@b>, %noreg, %vreg1;
> mem:ST4[@b](align=8) GR32:%vreg1
>
> Any suggestions are welcomed.
>
> Thanks,
> Wei.
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>