[PATCH] [LSR] Generate and use zero extends

Mon Apr 27 14:05:15 PDT 2015

> Does your test fail on trunk? This is what the test outputs for me:
>
> ok_161:                                           ; preds = %ok_158
>
>   %lsr.iv.next = add nuw nsw i64 %lsr.iv, 1
>   %4 = add i64 %0, %lsr.iv.next
>   %tmp1 = trunc i64 %4 to i32
>   %tmp188 = icmp slt i32 %tmp1, %tmp160
>   br i1 %tmp188, label %ok_146, label %block_81

That's behavior I'm trying to avoid -- I don't want two additions on
every iteration of the loop.

To elaborate a little bit, here is the -debug-only=loop-reduce on the
test case before the change:

'''
After filtering out undesirable candidates:
LSR is examining the following uses:
  LSR Use: Kind=Basic, Offsets={0}, widest fixup type: i32
    reg({%tmp156,+,1}<nuw><nsw><%ok_146>)
    reg((zext i32 %tmp156 to i64)) + 1*reg({2,+,1}<nw><%ok_146>) + imm(-2)
    reg((zext i32 %tmp156 to i64)) + 1*reg({0,+,1}<nuw><%ok_146>)
    reg((zext i32 %tmp156 to i64)) + 1*reg({-1,+,1}<nw><%ok_146>) + imm(1)
    reg((zext i32 %tmp156 to i64)) + 1*reg({3,+,1}<nw><%ok_146>) + imm(-3)
  LSR Use: Kind=Basic, Offsets={0}, all-fixups-outside-loop, widest
fixup type: i64
    reg({(-1 + (zext i32 %tmp156 to i64)),+,1}<nw><%ok_146>)
    reg({(zext i32 %tmp156 to i64),+,1}<nuw><nsw><%ok_146>) + imm(-1)
    reg((zext i32 %tmp156 to i64)) + 1*reg({0,+,1}<nuw><%ok_146>) + imm(-1)
    reg((zext i32 %tmp156 to i64)) + 1*reg({-1,+,1}<nw><%ok_146>)
    reg((-1 + (zext i32 %tmp156 to i64))) + 1*reg({0,+,1}<nuw><%ok_146>)
    reg({(3 + (zext i32 %tmp156 to i64)),+,1}<nw><%ok_146>) + imm(-4)
    reg({(2 + (zext i32 %tmp156 to i64)),+,1}<nw><%ok_146>) + imm(-3)
  LSR Use: Kind=Address of double, Offsets={16}, widest fixup type: double*
    -16 + reg({(16 + (8 * (zext i32 %tmp156 to i64)) + %d),+,8}<nw><%ok_146>)
    -24 + reg(((8 * (zext i32 %tmp156 to i64)) + %d)) + 4*reg({6,+,2}<%ok_146>)
    -16 + reg(((8 * (zext i32 %tmp156 to i64)) + %d)) + 2*reg({8,+,4}<%ok_146>)
    -24 + reg(((8 * (zext i32 %tmp156 to i64)) + %d)) + 2*reg({12,+,4}<%ok_146>)
    -16 + reg(%d) + 8*reg({(2 + (zext i32 %tmp156 to i64)),+,1}<nw><%ok_146>)
    -24 + reg(%d) + 8*reg({(3 + (zext i32 %tmp156 to i64)),+,1}<nw><%ok_146>)
    -8 + reg((16 + %d)<nsw>) + 8*reg({(-1 + (zext i32 %tmp156 to
i64)),+,1}<nw><%ok_146>)
    -16 + reg(((8 * (zext i32 %tmp156 to i64)) + %d)) + 4*reg({4,+,2}<%ok_146>)
    -16 + reg(((8 * (zext i32 %tmp156 to i64)) + %d)) +
8*reg({2,+,1}<nw><%ok_146>)
    -16 + reg((16 + (8 * (zext i32 %tmp156 to i64)) + %d)) +
2*reg({0,+,4}<%ok_146>)
    -16 + reg((16 + (8 * (zext i32 %tmp156 to i64)) + %d)) +
4*reg({0,+,2}<%ok_146>)
    -16 + reg((16 + (8 * (zext i32 %tmp156 to i64)) + %d)) +
8*reg({0,+,1}<nuw><%ok_146>)
    -16 + reg((16 + %d)<nsw>) + 2*reg({(4 * (zext i32 %tmp156 to
i64)),+,4}<nuw><nsw><%ok_146>)
    -16 + reg((16 + %d)<nsw>) + 4*reg({(2 * (zext i32 %tmp156 to
i64)),+,2}<%ok_146>)
    -16 + reg((16 + %d)<nsw>) + 8*reg({(zext i32 %tmp156 to
i64),+,1}<nuw><nsw><%ok_146>)
  LSR Use: Kind=Address of i32, Offsets={12}, widest fixup type: i32*
    -12 + reg({(12 + (4 * (zext i32 %tmp156 to i64)) + %b),+,4}<nw><%ok_146>)
    -20 + reg((12 + (4 * (zext i32 %tmp156 to i64)) + %b)) +
2*reg({4,+,2}<%ok_146>)
    -12 + reg((12 + (4 * (zext i32 %tmp156 to i64)) + %b)) +
1*reg({0,+,4}<%ok_146>)
    -12 + reg((12 + %b)<nsw>) + 1*reg({(4 * (zext i32 %tmp156 to
i64)),+,4}<nuw><nsw><%ok_146>)
    -12 + reg(((4 * (zext i32 %tmp156 to i64)) + %b)) + 2*reg({6,+,2}<%ok_146>)
    -20 + reg((12 + (4 * (zext i32 %tmp156 to i64)) + %b)) +
1*reg({8,+,4}<%ok_146>)
    -12 + reg(((4 * (zext i32 %tmp156 to i64)) + %b)) + 1*reg({12,+,4}<%ok_146>)
    -8 + reg(%b) + 4*reg({(2 + (zext i32 %tmp156 to i64)),+,1}<nw><%ok_146>)
    -12 + reg(%b) + 4*reg({(3 + (zext i32 %tmp156 to i64)),+,1}<nw><%ok_146>)
    -8 + reg((12 + %b)<nsw>) + 4*reg({(-1 + (zext i32 %tmp156 to
i64)),+,1}<nw><%ok_146>)
    -12 + reg(((4 * (zext i32 %tmp156 to i64)) + %b)) +
4*reg({3,+,1}<nw><%ok_146>)
    -12 + reg((12 + (4 * (zext i32 %tmp156 to i64)) + %b)) +
2*reg({0,+,2}<%ok_146>)
    -12 + reg((12 + (4 * (zext i32 %tmp156 to i64)) + %b)) +
4*reg({0,+,1}<nuw><%ok_146>)
    -12 + reg((12 + %b)<nsw>) + 2*reg({(2 * (zext i32 %tmp156 to
i64)),+,2}<%ok_146>)
    -12 + reg((12 + %b)<nsw>) + 4*reg({(zext i32 %tmp156 to
i64),+,1}<nuw><nsw><%ok_146>)
New best at 4 regs, with addrec cost 2, plus 2 scale cost, plus 9 imm
cost, plus 3 setup cost.
 Regs: {%tmp156,+,1}<nuw><nsw><%ok_146> {(-1 + (zext i32 %tmp156 to
i64)),+,1}<nw><%ok_146> (16 + %d)<nsw> (12 + %b)<nsw>
New best at 4 regs, with addrec cost 1, plus 3 base adds, plus 2 scale
cost, plus 3 setup cost.
 Regs: {0,+,1}<nuw><%ok_146> (zext i32 %tmp156 to i64) (16 + (8 *
(zext i32 %tmp156 to i64)) + %d) (12 + (4 * (zext i32 %tmp156 to i64))
+ %b)

The chosen solution requires 4 regs, with addrec cost 1, plus 3 base
adds, plus 2 scale cost, plus 3 setup cost:
  LSR Use: Kind=Basic, Offsets={0}, widest fixup type: i32
    reg((zext i32 %tmp156 to i64)) + 1*reg({0,+,1}<nuw><%ok_146>)
  LSR Use: Kind=Basic, Offsets={0}, all-fixups-outside-loop, widest
fixup type: i64
    reg((zext i32 %tmp156 to i64)) + 1*reg({0,+,1}<nuw><%ok_146>) + imm(-1)
  LSR Use: Kind=Address of double, Offsets={16}, widest fixup type: double*
    -16 + reg((16 + (8 * (zext i32 %tmp156 to i64)) + %d)) +
8*reg({0,+,1}<nuw><%ok_146>)
  LSR Use: Kind=Address of i32, Offsets={12}, widest fixup type: i32*
    -12 + reg((12 + (4 * (zext i32 %tmp156 to i64)) + %b)) +
4*reg({0,+,1}<nuw><%ok_146>)
'''

And here is the debug output after the change.  LSR now "sees" a few
more candidate formulae (marked with '<- new solution +++++'):

'''
After filtering out undesirable candidates:
LSR is examining the following uses:
  LSR Use: Kind=Basic, Offsets={0}, widest fixup type: i32
    reg({%tmp156,+,1}<nuw><nsw><%ok_146>)
    reg(%tmp156) + 1*reg({0,+,1}<nuw><%ok_146>)
    reg((zext i32 %tmp156 to i64)) + 1*reg({0,+,1}<nuw><%ok_146>)
    reg((zext i32 %tmp156 to i64)) + 1*reg({-1,+,1}<nw><%ok_146>) + imm(1)
    reg((zext i32 %tmp156 to i64)) + 1*reg({3,+,1}<nw><%ok_146>) + imm(-3)
    reg((zext i32 %tmp156 to i64)) + 1*reg({2,+,1}<nw><%ok_146>) + imm(-2)
  LSR Use: Kind=Basic, Offsets={0}, all-fixups-outside-loop, widest
fixup type: i64
    reg({(-1 + (zext i32 %tmp156 to i64)),+,1}<nw><%ok_146>)
    reg({(zext i32 %tmp156 to i64),+,1}<nuw><nsw><%ok_146>) + imm(-1)
    reg((zext i32 %tmp156 to i64)) + 1*reg({0,+,1}<nuw><%ok_146>) + imm(-1)
    reg((zext i32 %tmp156 to i64)) + 1*reg({-1,+,1}<nw><%ok_146>)
    reg((-1 + (zext i32 %tmp156 to i64))) + 1*reg({0,+,1}<nuw><%ok_146>)
    reg({%tmp156,+,1}<nuw><nsw><%ok_146>) + imm(-1)   <- new solution +++++
    reg({(3 + (zext i32 %tmp156 to i64)),+,1}<nw><%ok_146>) + imm(-4)
    reg({(2 + (zext i32 %tmp156 to i64)),+,1}<nw><%ok_146>) + imm(-3)
  LSR Use: Kind=Address of double, Offsets={16}, widest fixup type: double*
    -16 + reg({(16 + (8 * (zext i32 %tmp156 to i64)) + %d),+,8}<nw><%ok_146>)
    -24 + reg(((8 * (zext i32 %tmp156 to i64)) + %d)) + 4*reg({6,+,2}<%ok_146>)
    -16 + reg(((8 * (zext i32 %tmp156 to i64)) + %d)) + 2*reg({8,+,4}<%ok_146>)
    -24 + reg(((8 * (zext i32 %tmp156 to i64)) + %d)) + 2*reg({12,+,4}<%ok_146>)
    -16 + reg(%d) + 8*reg({(2 + (zext i32 %tmp156 to i64)),+,1}<nw><%ok_146>)
    -24 + reg(%d) + 8*reg({(3 + (zext i32 %tmp156 to i64)),+,1}<nw><%ok_146>)
    -8 + reg((16 + %d)<nsw>) + 8*reg({(-1 + (zext i32 %tmp156 to
i64)),+,1}<nw><%ok_146>)
    -16 + reg((16 + %d)<nsw>) +
8*reg({%tmp156,+,1}<nuw><nsw><%ok_146>)   <- new solution +++++
    -16 + reg(((8 * (zext i32 %tmp156 to i64)) + %d)) + 4*reg({4,+,2}<%ok_146>)
    -16 + reg(((8 * (zext i32 %tmp156 to i64)) + %d)) +
8*reg({2,+,1}<nw><%ok_146>)
    -16 + reg((16 + (8 * (zext i32 %tmp156 to i64)) + %d)) +
2*reg({0,+,4}<%ok_146>)
    -16 + reg((16 + (8 * (zext i32 %tmp156 to i64)) + %d)) +
4*reg({0,+,2}<%ok_146>)
    -16 + reg((16 + (8 * (zext i32 %tmp156 to i64)) + %d)) +
8*reg({0,+,1}<nuw><%ok_146>)
    -16 + reg((16 + %d)<nsw>) + 2*reg({(4 * (zext i32 %tmp156 to
i64)),+,4}<nuw><nsw><%ok_146>)
    -16 + reg((16 + %d)<nsw>) + 4*reg({(2 * (zext i32 %tmp156 to
i64)),+,2}<%ok_146>)
    -16 + reg((16 + %d)<nsw>) + 8*reg({(zext i32 %tmp156 to
i64),+,1}<nuw><nsw><%ok_146>)
  LSR Use: Kind=Address of i32, Offsets={12}, widest fixup type: i32*
    -12 + reg({(12 + (4 * (zext i32 %tmp156 to i64)) + %b),+,4}<nw><%ok_146>)
    -20 + reg((12 + (4 * (zext i32 %tmp156 to i64)) + %b)) +
2*reg({4,+,2}<%ok_146>)
    -12 + reg((12 + (4 * (zext i32 %tmp156 to i64)) + %b)) +
1*reg({0,+,4}<%ok_146>)
    -12 + reg((12 + %b)<nsw>) + 1*reg({(4 * (zext i32 %tmp156 to
i64)),+,4}<nuw><nsw><%ok_146>)
    -12 + reg(((4 * (zext i32 %tmp156 to i64)) + %b)) + 2*reg({6,+,2}<%ok_146>)
    -20 + reg((12 + (4 * (zext i32 %tmp156 to i64)) + %b)) +
1*reg({8,+,4}<%ok_146>)
    -12 + reg(((4 * (zext i32 %tmp156 to i64)) + %b)) + 1*reg({12,+,4}<%ok_146>)
    -8 + reg(%b) + 4*reg({(2 + (zext i32 %tmp156 to i64)),+,1}<nw><%ok_146>)
    -12 + reg(%b) + 4*reg({(3 + (zext i32 %tmp156 to i64)),+,1}<nw><%ok_146>)
    -8 + reg((12 + %b)<nsw>) + 4*reg({(-1 + (zext i32 %tmp156 to
i64)),+,1}<nw><%ok_146>)
    -12 + reg((12 + %b)<nsw>) +
4*reg({%tmp156,+,1}<nuw><nsw><%ok_146>)  <- new solution +++++
    -12 + reg(((4 * (zext i32 %tmp156 to i64)) + %b)) +
4*reg({3,+,1}<nw><%ok_146>)
    -12 + reg((12 + (4 * (zext i32 %tmp156 to i64)) + %b)) +
2*reg({0,+,2}<%ok_146>)
    -12 + reg((12 + (4 * (zext i32 %tmp156 to i64)) + %b)) +
4*reg({0,+,1}<nuw><%ok_146>)
    -12 + reg((12 + %b)<nsw>) + 2*reg({(2 * (zext i32 %tmp156 to
i64)),+,2}<%ok_146>)
    -12 + reg((12 + %b)<nsw>) + 4*reg({(zext i32 %tmp156 to
i64),+,1}<nuw><nsw><%ok_146>)
New best at 3 regs, with addrec cost 1, plus 1 base add, plus 2 scale
cost, plus 2 setup cost.
 Regs: {%tmp156,+,1}<nuw><nsw><%ok_146> (16 + %d)<nsw> (12 + %b)<nsw>

The chosen solution requires 3 regs, with addrec cost 1, plus 1 base
add, plus 2 scale cost, plus 2 setup cost:
  LSR Use: Kind=Basic, Offsets={0}, widest fixup type: i32
    reg({%tmp156,+,1}<nuw><nsw><%ok_146>)
  LSR Use: Kind=Basic, Offsets={0}, all-fixups-outside-loop, widest
fixup type: i64
    reg({%tmp156,+,1}<nuw><nsw><%ok_146>) + imm(-1)
  LSR Use: Kind=Address of double, Offsets={16}, widest fixup type: double*
    -16 + reg((16 + %d)<nsw>) + 8*reg({%tmp156,+,1}<nuw><nsw><%ok_146>)
  LSR Use: Kind=Address of i32, Offsets={12}, widest fixup type: i32*
    -12 + reg((12 + %b)<nsw>) + 4*reg({%tmp156,+,1}<nuw><nsw><%ok_146>)
'''

The new formulae lets LSR choose a better solution.

Looking at the output tells me that I should definitely add something
to the textual representation of a formulae that has an implicit zero
extend.

> Would you be able to test both post-inc and pre-inc variants? We've a
> lot of bugs because of TransformForPostIncUse.

I think that is a good idea.  I'm not familiar with LSR so I suspect
it will take me some time to get there, but just that is actually a
good reason for writing these tests. :)

It also looks like (I'm not a 100% sure) that there's an inconsistency
in how LSR treats overflow in post-inc expressions vs. how SCEV treats
them.  For instance, in ScalarEvolution.cpp line ~1500 we say that
{S,+,X} is <nuw> if "S + MaxBECount * X" does not unsigned
overflow. But that does not say anything about the overflow of the
post-inc SCEV expression (e.g. if BECount == 0 then all add recs are
<nuw>).

The problem is that TransformForPostIncUse(zext({S,+,X})) =
zext({S+X,+,X}) -- ScalarEvolutionNormalization.cpp, line ~100.

Now if {S,+,1} is nuw then {zext S,+,1} = zext({S,+,1}), but
{(zext S) + 1,+,1} is != zext({S+1,+,1}) by SCEV's definition of nuw.
But TransformForPostIncUse({zext S,+,1}) =
TransformForPostIncUse(zext({S,+,1})) = zext({S+1,+,1}) when it should
really be {(zext S) + 1,+,1}.  I don't have a concrete example where
this is a problem, but this definitely looks fishy to me.

-- Sanjoy

On Mon, Apr 27, 2015 at 11:14 AM, Andrew Trick <atrick at apple.com> wrote:
> This looks reasonable. Thanks!
>
> Does your test fail on trunk? This is what the test outputs for me:
>
> ok_161:                                           ; preds = %ok_158
>
>   %lsr.iv.next = add nuw nsw i64 %lsr.iv, 1
>   %4 = add i64 %0, %lsr.iv.next
>   %tmp1 = trunc i64 %4 to i32
>   %tmp188 = icmp slt i32 %tmp1, %tmp160
>   br i1 %tmp188, label %ok_146, label %block_81
>
> Would you be able to test both post-inc and pre-inc variants? We've a lot of bugs because of TransformForPostIncUse.
>
>
> ================
> Comment at: lib/Transforms/Scalar/LoopStrengthReduce.cpp:3685
> @@ -3669,1 +3684,3 @@
>
> +/// GenerateZExts - Generate reuse formulae from different IV types.
> +void LSRInstance::GenerateZExts(LSRUse &LU, unsigned LUIdx, Formula Base) {
> ----------------
> Please comment this at least as well as your commit comment if not better.
>
> http://reviews.llvm.org/D9181
>
> EMAIL PREFERENCES
>   http://reviews.llvm.org/settings/panel/emailpreferences/
>
>