[PATCH] D21620: [LV] Don't widen trivial induction variables

Thu Jun 23 04:30:04 PDT 2016

mssimpso added a comment.

In http://reviews.llvm.org/D21620#465284, @anemet wrote:

> I am also wondering if this is general goodness that should be in instcombine rather than adding more special-casing to the already complex vectorizer.

Adam,

I'm not sure I agree with this. InstCombine/InstructionSimplify are also very complex, and if you look at my reply to Michael, simplifying these cases is not straightforward. I don't think we want, or can, have InstCombine be an "un-vectorizer". I just doesn't make sense to vectorize these trivial induction variables in the first place.

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:2189
@@ +2188,3 @@
+
+  // We can't create a vector with less than two elements.
+  assert(VF > 1 && "VF should be greater than one");
----------------
mkuper wrote:
> mssimpso wrote:
> > mkuper wrote:
> > > mssimpso wrote:
> > > > mkuper wrote:
> > > > > This looks a bit weird to me.Do you have a test-case where this works, but instcombine fails to simplify the regular step vector?
> > > > > Perhaps it would be better to fix instcombine, I can look into it.
> > > > > 
> > > > > In any case, that shouldn't block the rest of the patch - (a) not widening the phi when the uses are trivial, and (b) using the scalarized step vector instead of the regular one should be independently good, right?
> > > > > This looks a bit weird to me.Do you have a test-case where this works, but instcombine fails to simplify the regular step vector?
> > > > 
> > > > Sure. Here's a simple example with interleaved access vectorization enabled.
> > > > 
> > > > In the test case below, if we use the regular step vector, we end up creating unneeded splat inserts and extracts inside the loop. A scalar induction variable is definitely preferable here since we don't do anything exciting with it.
> > > > 
> > > > ```
> > > > 
> > > > ; opt -S < %s -loop-vectorize -force-vector-width=2 -force-vector-interleave=2 -enable-interleaved-mem-accesses -instcombine
> > > > target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128"
> > > > target triple = "aarch64--linux-gnu"
> > > > 
> > > > %pair = type { i64, i64 }
> > > > define void @interleaved(%pair *%p, i64 %y, i64 %n) {
> > > > entry:
> > > >   br label %for.body
> > > > 
> > > > for.body:
> > > >   %i  = phi i64 [ %i.next, %for.body ], [ 0, %entry ]
> > > >   %f1 = getelementptr inbounds %pair, %pair* %p, i64 %i, i32 1
> > > >   %r0 = load i64, i64* %f1, align 8
> > > >   %r3 = xor i64 %r0, %y
> > > >   store i64 %r3, i64* %f1, align 8
> > > >   %i.next = add nuw nsw i64 %i, 1
> > > >   %cond = icmp slt i64 %i.next, %n
> > > >   br i1 %cond, label %for.body, label %for.end
> > > > 
> > > > for.end:
> > > >   ret void
> > > > }
> > > > ```
> > > > 
> > > > I've seen other cases with what I've been working on that lead me to believe fixing this kind of thing in InstCombine is not going to be that fruitful. Things get very ugly when interleaved access vectorization is enabled with conditional stores vectorization, for example. I think it's better to do the right thing at the outset.
> > > Sorry, I wasn't very clear.
> > > 
> > > I didn't mean an example of when a scalar IV is preferable. You're completely right about that.
> > > I meant a case where, in addition to the scalar IV, we want the pre-scalarized step vector, because the vectorized one doesn't simplify. Or is this the same example?
> > No, you were clear. This is the same example -- it doesn't simplify. Without the pre-scalarized step vector, we're left with the splats, extracts, etc. I will add it as a test case to the patch.
> Ah, ok, I was confused because you wrote that "a scalar induction variable is definitely preferable".
> Anyway, this does look like an InstCombine issue.
> We get:
>   %broadcast.splatinsert = insertelement <2 x i64> undef, i64 %index, i32 0
>   %broadcast.splat = shufflevector <2 x i64> %broadcast.splatinsert, <2 x i64> undef, <2 x i32> zeroinitializer
>   %induction = add <2 x i64> %broadcast.splat, <i64 0, i64 1>
>   %induction1 = add <2 x i64> %broadcast.splat, <i64 2, i64 3>
>   %3 = extractelement <2 x i64> %induction, i32 0
>   %4 = getelementptr inbounds %pair, %pair* %p, i64 %3, i32 1
>   %5 = insertelement <2 x i64*> undef, i64* %4, i32 0
>   %6 = extractelement <2 x i64> %induction, i32 1
>   %7 = getelementptr inbounds %pair, %pair* %p, i64 %6, i32 1
>   %8 = insertelement <2 x i64*> %5, i64* %7, i32 1
>   %9 = extractelement <2 x i64> %induction1, i32 0
>   %10 = getelementptr inbounds %pair, %pair* %p, i64 %9, i32 1
>   %11 = insertelement <2 x i64*> undef, i64* %10, i32 0
>   %12 = extractelement <2 x i64> %induction1, i32 1
>   %13 = getelementptr inbounds %pair, %pair* %p, i64 %12, i32 1
>   %14 = insertelement <2 x i64*> %11, i64* %13, i32 1
> 
> InstCombine cleans this up to:
>   %broadcast.splatinsert = insertelement <2 x i64> undef, i64 %index, i32 0
>   %broadcast.splat = shufflevector <2 x i64> %broadcast.splatinsert, <2 x i64> undef, <2 x i32> zeroinitializer
>   %induction1 = add <2 x i64> %broadcast.splat, <i64 2, i64 3>
>   %3 = getelementptr inbounds %pair, %pair* %p, i64 %index, i32 1
>   %4 = or i64 %index, 1
>   %5 = getelementptr inbounds %pair, %pair* %p, i64 %4, i32 1
>   %6 = extractelement <2 x i64> %induction1, i32 0
>   %7 = getelementptr inbounds %pair, %pair* %p, i64 %6, i32 1
>   %8 = extractelement <2 x i64> %induction1, i32 1
>   %9 = getelementptr inbounds %pair, %pair* %p, i64 %8, i32 1
>   %10 = bitcast i64* %3 to <4 x i64>*
>   %11 = bitcast i64* %7 to <4 x i64>*
> 
> I don't immediately see why it should be able to handle %induction, but not %induction1.
> Am I missing something obvious?
InstCombine/InstructionSimplify is only able to simplify %induction because it knows element zero of the add is a noop:

```
%induction = add <2 x i64> %broadcast.splat, <i64 0, i64 1>
```

It can then replace the extract from element 0 with %index, and then the rest falls out. We can't do the same thing for %induction1, though. The add is not a noop. InstCombine would have to decide it would be beneficial to unvectorize actual computation, replacing a single vector add with 2 scalar adds. It doesn't seem like InstCombine is equipped to do this. And I think you would need a cost model for this, anyway.

http://reviews.llvm.org/D21620