[llvm-dev] [InstCombine] Simplification sometimes only transforms but doesn't simplify instruction, causing side effect in other pass
Xinliang David Li via llvm-dev
llvm-dev at lists.llvm.org
Wed Aug 2 16:36:23 PDT 2017
On Wed, Aug 2, 2017 at 3:36 PM, Matthias Braun <mbraun at apple.com> wrote:
> So to write this in a more condensed form, you have:
>
> %v0 = ...
> %v1 = and %v0, 255
> %v2 = and %v1, 31
> use %v1
> use %v2
>
> and transform this to
> %v0 = ...
> %v1 = and %v0, 255
> %v2 = and %v0, 31
> ...
>
> This is a classical problem with instruction combining. When you replace a
> non-root node of a pattern (in this case %v2 would be the root pattern we
> match, but we actually replace %v1 with %v0 as part of the replacement) and
> that non-root node has multiple users then it becomes very hard to predict
> whether the transformations pays off in the end.
>
> In my experience it is best to restrict those sort of transformation to
> situations where you can prove there aren't multiple users on the
> intermediate/non-root node that we replace. (Or if you want to get fancier
> that all users
> will match the same pattern).
>
I agree. The propagation happens in the hope that v1 has only one use so
that '%v1 = ..' can be later eliminated, or that v1 has only two uses such
that the propagation can shrink the live range of %v1, saving one move
instruction for the other use of %v1. In reality, this is like flip a
coin and likely to hurt performance on x86.
Avoid doing this one the fly, but in a separate pass might be the way to go.
David
>
> - Matthias
>
> > On Aug 2, 2017, at 3:00 PM, Wei Mi via llvm-dev <llvm-dev at lists.llvm.org>
> wrote:
> >
> > Hi,
> >
> > We recently found a testcase showing that simplifications in
> > instcombine sometimes change the instruction without reducing the
> > instruction cost, but causing problems in TwoAddressInstruction pass.
> > And it looks like the problem is generic and other simplification may
> > have the same issue. I want to get some ideas about what is the best
> > way to fix such kind of problem.
> >
> > The testcase:
> > ----------------------------- a.ll ----------------------------------
> > @a = global i64 0, align 8
> > @b = global i32 0, align 8
> >
> > define i32 @_Z25InternalUncompressAllTagsv(i16* %arrayidx)
> local_unnamed_addr {
> > entry:
> > %t1 = load i16, i16* %arrayidx, align 2
> > %conv.i = zext i16 %t1 to i32
> > %and.i = and i32 %conv.i, 255
> > %and7.i = and i32 %conv.i, 1792
> > %t3 = zext i32 %and7.i to i64
> > %t4 = load i64, i64* @a, align 8
> > %add.i = add i64 %t4, %t3
> > %cmp1 = icmp eq i64 %add.i, 1
> > %cmp2 = icmp ult i32 %and.i, 101
> > %bool = and i1 %cmp1, %cmp2
> > br i1 %bool, label %if.then, label %if.else, !prof !0
> >
> > if.then: ; preds = %entry
> > %r1 = trunc i64 %add.i to i32
> > br label %return
> >
> > if.else: ; preds = %entry
> > %r2 = and i32 %and.i, 31
> > store i32 %and.i, i32* @b, align 8
> > br label %return
> >
> > return: ; preds = %if.else,
> %if.then
> > %ret = phi i32 [ %r1, %if.then ], [ %r2, %if.else ]
> > ret i32 %ret
> > }
> >
> > !0 = !{!"branch_weights", i32 2000, i32 1}
> > ------------------------------------------------------------
> -------------
> >
> > For the snippet:
> > %and.i = and i32 %conv.i, 255
> > ...
> > %r2 = and i32 %and.i, 31
> >
> > Look at %r2 in block %if.else, it is computed by two "and" operations.
> > Both InstCombiner::SimplifyAssociativeOrCommutative and
> > InstCombiner::SimplifyDemandedUseBits can replace %r2 = and i32
> > %and.i, 31 with %r2 = and i32 %conv.i, 31. Because %and.i has many
> > other uses and it won't become dead after the simplifications, those
> > simplifications won't simplify instruction cost, but they will change
> > the live range of variables:
> > Before instcombine, %conv.i is last used at %and7.i = and i32 %conv.i,
> > 1792, so on architecture like x86, llvm won't generate extra mov in
> > TwoAddressInstruction pass for it. After instcombine, %conv.i's live
> > range is extended, both %and7.i and %conv.i are alive after the "and"
> > instruction, so TwoAddressInstruction pass will generate an extra mov
> > for %and7.i = and i32 %conv.i.
> >
> > *** After instcombine: ***
> > ~/workarea/llvm-r309240/rbuild1/bin/opt -instcombine -S < a.ll > b.ll;
> cat b.ll
> > @a = global i64 0, align 8
> > @b = global i32 0, align 8
> >
> > define i32 @_Z25InternalUncompressAllTagsv(i16* %arrayidx)
> local_unnamed_addr {
> > entry:
> > %t1 = load i16, i16* %arrayidx, align 2
> > %conv.i = zext i16 %t1 to i32
> > %and.i = and i32 %conv.i, 255
> > %and7.i = and i32 %conv.i, 1792 ============> %conv.i
> > is alive since its live range is extended.
> > %t3 = zext i32 %and7.i to i64
> > We need a extra mov for this two address instruction on x86
> > %t4 = load i64, i64* @a, align 8
> > %add.i = add i64 %t4, %t3
> > %cmp1 = icmp eq i64 %add.i, 1
> > %cmp2 = icmp ult i32 %and.i, 101
> > %bool = and i1 %cmp1, %cmp2
> > br i1 %bool, label %if.then, label %if.else
> >
> > if.then: ; preds = %entry
> > %r1 = trunc i64 %add.i to i32
> > br label %return
> >
> > if.else: ; preds = %entry
> > %r2 = and i32 %conv.i, 31
> > ===============> %and.i is replaced by %conv.i.
> > store i32 %and.i, i32* @b, align 8
> > br label %return
> >
> > return: ; preds = %if.else,
> %if.then
> > %ret = phi i32 [ %r1, %if.then ], [ %r2, %if.else ]
> > ret i32 %ret
> > }
> >
> > *** asm code without instcombine: ***
> > ~/workarea/llvm-r309240/rbuild1/bin/llc < a.ll
> > # BB#0: # %entry
> > movzwl (%rdi), %ecx
> > movzbl %cl, %eax
> > andl $1792, %ecx # imm = 0x700
> > addq a(%rip), %rcx
> > cmpq $1, %rcx
> > jne .LBB0_3
> >
> > *** asm code with instcombine: ***
> > ~/workarea/llvm-r309240/rbuild1/bin/llc < b.ll
> > # BB#0: # %entry
> > movzwl (%rdi), %eax
> > movzbl %al, %ecx
> > movl %eax, %edx =====> One extra move
> > instruction in the hot path
> > andl $1792, %edx # imm = 0x700
> > addq a(%rip), %rdx
> > cmpq $1, %rdx
> > jne .LBB0_3
> >
> > I am not sure if it is possible to tackle the problem in
> > simplification or instcombine directly, so I explored the possibility
> > to do the reversal transformation in CodeGen, but I found that not
> > quite easy either because of two things:
> > * Seems to me Peephole pass or TwoAddrInstruction pass are the proper
> > places to do the reversal transformation. We need Live Interval
> > information to justify the transformation, however currently live
> > Interval information is not ready in both places.
> >
> > * The pattern matching looks quite ad hoc on machine IR. I need to
> > figure out we can replace %vreg0 in "AND32ri8 %vreg0<tied0>, 31" with
> > %vreg1 by looking at the copy chain starting from %vreg9<def> = COPY
> > %vreg0 to %vreg1<def> = MOVZX32rr8 %vreg9 first, and at the same time,
> > after replacing vreg0 with %vreg1, vreg0 becomes dead at the other
> > AND32ri and we can save an instruction there. In addition, replace
> > %vreg0 with %vreg1 may increase an extra move before "AND32ri8
> > %vreg0<tied0>, 31", so we still need to check "AND32ri8 %vreg0<tied0>,
> > 31" is colder than "AND32ri %vreg0<tied0>, 1792".
> > All these efforts are just handling a specific pattern, and if the
> > pattern changes a little bit, they won't work.
> > BB_i:
> > %vreg9<def> = COPY %vreg0:sub_8bit; GR8:%vreg9 GR32:%vreg0
> > %vreg1<def> = MOVZX32rr8 %vreg9<kill>; GR32:%vreg1 GR8:%vreg9
> > %vreg10<def,tied1> = AND32ri %vreg0<tied0>, 1792,
> > %EFLAGS<imp-def,dead>; GR32:%vreg10,%vreg0
> > ...
> > BB_j:
> > %vreg4<def,tied1> = AND32ri8 %vreg0<tied0>, 31,
> > %EFLAGS<imp-def,dead>; GR32:%vreg4,%vreg0
> > MOV32mr %RIP, 1, %noreg, <ga:@b>, %noreg, %vreg1;
> > mem:ST4[@b](align=8) GR32:%vreg1
> >
> > Any suggestions are welcomed.
> >
> > Thanks,
> > Wei.
> > _______________________________________________
> > LLVM Developers mailing list
> > llvm-dev at lists.llvm.org
> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170802/d94f83d1/attachment.html>
More information about the llvm-dev
mailing list