[llvm-dev] [InstCombine] Simplification sometimes only transforms but doesn't simplify instruction, causing side effect in other pass

Wed Aug 2 16:00:34 PDT 2017

On Wed, Aug 2, 2017 at 3:36 PM, Matthias Braun <mbraun at apple.com> wrote:
> So to write this in a more condensed form, you have:
>
> %v0 = ...
> %v1 = and %v0, 255
> %v2 = and %v1, 31
> use %v1
> use %v2
>
> and transform this to
> %v0 = ...
> %v1 = and %v0, 255
> %v2 = and %v0, 31
> ...
>
> This is a classical problem with instruction combining. When you replace a non-root node of a pattern (in this case %v2 would be the root pattern we match, but we actually replace %v1 with %v0 as part of the replacement) and that non-root node has multiple users then it becomes very hard to predict whether the transformations pays off in the end.
>
> In my experience it is best to restrict those sort of transformation to situations where you can prove there aren't multiple users on the intermediate/non-root node that we replace. (Or if you want to get fancier that all users
> will match the same pattern).
>
> - Matthias

Thanks Matthias, yes, the transformation in instcombine is of non-root
node pattern. What I am worried about to add the restriction that %v1
has to have only one use is that such change may regress some
benchmarks. For the same case, after we add the limitation, we will
save a move in BB1, but may add another move in BB2, because %v2 and
%v1 are both alive after %v2 = and %v1, 31.  Which choice is better
depends on the hotness of BB1 and BB2, so the result of such
transformation in instcombine is like flipping a coin.  Currently we
don't have a pass to use BFI to do such transformation in a meaningful
way. Maybe we need such a pass, and with that pass, we can disable the
transformations in instcombine without regressing certain benchmarks.

BB1:
> %v0 = ...
> %v1 = and %v0, 255

BB2:
> %v2 = and %v1, 31
> use %v1
> use %v2

Wei.

>
>> On Aug 2, 2017, at 3:00 PM, Wei Mi via llvm-dev <llvm-dev at lists.llvm.org> wrote:
>>
>> Hi,
>>
>> We recently found a testcase showing that simplifications in
>> instcombine sometimes change the instruction without reducing the
>> instruction cost,  but causing problems in TwoAddressInstruction pass.
>> And it looks like the problem is generic and other simplification may
>> have the same issue. I want to get some ideas about what is the best
>> way to fix such kind of problem.
>>
>> The testcase:
>> ----------------------------- a.ll ----------------------------------
>> @a = global i64 0, align 8
>> @b = global i32 0, align 8
>>
>> define i32 @_Z25InternalUncompressAllTagsv(i16* %arrayidx) local_unnamed_addr {
>> entry:
>>  %t1 = load i16, i16* %arrayidx, align 2
>>  %conv.i = zext i16 %t1 to i32
>>  %and.i = and i32 %conv.i, 255
>>  %and7.i = and i32 %conv.i, 1792
>>  %t3 = zext i32 %and7.i to i64
>>  %t4 = load i64, i64* @a, align 8
>>  %add.i = add i64 %t4, %t3
>>  %cmp1 = icmp eq i64 %add.i, 1
>>  %cmp2 = icmp ult i32 %and.i, 101
>>  %bool = and i1 %cmp1, %cmp2
>>  br i1 %bool, label %if.then, label %if.else, !prof !0
>>
>> if.then:                                          ; preds = %entry
>>  %r1 = trunc i64 %add.i to i32
>>  br label %return
>>
>> if.else:                                          ; preds = %entry
>>  %r2 = and i32 %and.i, 31
>>  store i32 %and.i, i32* @b, align 8
>>  br label %return
>>
>> return:                                           ; preds = %if.else, %if.then
>>  %ret = phi i32 [ %r1, %if.then ], [ %r2, %if.else ]
>>  ret i32 %ret
>> }
>>
>> !0 = !{!"branch_weights", i32 2000, i32 1}
>> -------------------------------------------------------------------------
>>
>> For the snippet:
>> %and.i = and i32 %conv.i, 255
>>  ...
>> %r2 = and i32 %and.i, 31
>>
>> Look at %r2 in block %if.else, it is computed by two "and" operations.
>> Both InstCombiner::SimplifyAssociativeOrCommutative and
>> InstCombiner::SimplifyDemandedUseBits can replace %r2 = and i32
>> %and.i, 31 with %r2 = and i32 %conv.i, 31. Because %and.i has many
>> other uses and it won't become dead after the simplifications, those
>> simplifications won't simplify instruction cost, but they will change
>> the live range of variables:
>> Before instcombine, %conv.i is last used at %and7.i = and i32 %conv.i,
>> 1792, so on architecture like x86, llvm won't generate extra mov in
>> TwoAddressInstruction pass for it. After instcombine, %conv.i's live
>> range is extended, both %and7.i and %conv.i are alive after the "and"
>> instruction, so TwoAddressInstruction pass will generate an extra mov
>> for %and7.i = and i32 %conv.i.
>>
>> *** After instcombine: ***
>> ~/workarea/llvm-r309240/rbuild1/bin/opt -instcombine -S < a.ll > b.ll; cat b.ll
>> @a = global i64 0, align 8
>> @b = global i32 0, align 8
>>
>> define i32 @_Z25InternalUncompressAllTagsv(i16* %arrayidx) local_unnamed_addr {
>> entry:
>>  %t1 = load i16, i16* %arrayidx, align 2
>>  %conv.i = zext i16 %t1 to i32
>>  %and.i = and i32 %conv.i, 255
>>  %and7.i = and i32 %conv.i, 1792              ============> %conv.i
>> is alive since its live range is extended.
>>  %t3 = zext i32 %and7.i to i64
>>         We need a extra mov for this two address instruction on x86
>>  %t4 = load i64, i64* @a, align 8
>>  %add.i = add i64 %t4, %t3
>>  %cmp1 = icmp eq i64 %add.i, 1
>>  %cmp2 = icmp ult i32 %and.i, 101
>>  %bool = and i1 %cmp1, %cmp2
>>  br i1 %bool, label %if.then, label %if.else
>>
>> if.then:                                          ; preds = %entry
>>  %r1 = trunc i64 %add.i to i32
>>  br label %return
>>
>> if.else:                                          ; preds = %entry
>>  %r2 = and i32 %conv.i, 31
>> ===============> %and.i is replaced by %conv.i.
>>  store i32 %and.i, i32* @b, align 8
>>  br label %return
>>
>> return:                                           ; preds = %if.else, %if.then
>>  %ret = phi i32 [ %r1, %if.then ], [ %r2, %if.else ]
>>  ret i32 %ret
>> }
>>
>> *** asm code without instcombine: ***
>> ~/workarea/llvm-r309240/rbuild1/bin/llc < a.ll
>> # BB#0:                                 # %entry
>>        movzwl  (%rdi), %ecx
>>        movzbl  %cl, %eax
>>        andl    $1792, %ecx             # imm = 0x700
>>        addq    a(%rip), %rcx
>>        cmpq    $1, %rcx
>>        jne     .LBB0_3
>>
>> *** asm code with instcombine: ***
>> ~/workarea/llvm-r309240/rbuild1/bin/llc < b.ll
>> # BB#0:                                 # %entry
>>        movzwl  (%rdi), %eax
>>        movzbl  %al, %ecx
>>        movl    %eax, %edx                =====>  One extra move
>> instruction in the hot path
>>        andl    $1792, %edx             # imm = 0x700
>>        addq    a(%rip), %rdx
>>        cmpq    $1, %rdx
>>        jne     .LBB0_3
>>
>> I am not sure if it is possible to tackle the problem in
>> simplification or instcombine directly, so I explored the possibility
>> to do the reversal transformation in CodeGen, but I found that not
>> quite easy either because of two things:
>> * Seems to me Peephole pass or TwoAddrInstruction pass are the proper
>> places to do the reversal transformation. We need Live Interval
>> information to justify the transformation, however currently live
>> Interval information is not ready in both places.
>>
>> * The pattern matching looks quite ad hoc on machine IR. I need to
>> figure out we can replace %vreg0 in "AND32ri8 %vreg0<tied0>, 31" with
>> %vreg1 by looking at the copy chain starting from %vreg9<def> = COPY
>> %vreg0 to %vreg1<def> = MOVZX32rr8 %vreg9 first, and at the same time,
>> after replacing vreg0 with %vreg1, vreg0 becomes dead at the other
>> AND32ri and we can save an instruction there. In addition, replace
>> %vreg0 with %vreg1 may increase an extra move before "AND32ri8
>> %vreg0<tied0>, 31", so we still need to check "AND32ri8 %vreg0<tied0>,
>> 31" is colder than "AND32ri %vreg0<tied0>, 1792".
>>  All these efforts are just handling a specific pattern, and if the
>> pattern changes a little bit, they won't work.
>> BB_i:
>>        %vreg9<def> = COPY %vreg0:sub_8bit; GR8:%vreg9 GR32:%vreg0
>>        %vreg1<def> = MOVZX32rr8 %vreg9<kill>; GR32:%vreg1 GR8:%vreg9
>>        %vreg10<def,tied1> = AND32ri %vreg0<tied0>, 1792,
>> %EFLAGS<imp-def,dead>; GR32:%vreg10,%vreg0
>> ...
>> BB_j:
>>        %vreg4<def,tied1> = AND32ri8 %vreg0<tied0>, 31,
>> %EFLAGS<imp-def,dead>; GR32:%vreg4,%vreg0
>>        MOV32mr %RIP, 1, %noreg, <ga:@b>, %noreg, %vreg1;
>> mem:ST4[@b](align=8) GR32:%vreg1
>>
>> Any suggestions are welcomed.
>>
>> Thanks,
>> Wei.
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>