[LLVMdev] Vectorizing alloca instructions

Thu Oct 24 16:08:05 PDT 2013

On Thu, Oct 24, 2013 at 02:15:00PM -0700, Nadav Rotem wrote:
> Hi Tom, 
> 
> Thanks for working on this.  The SLP-vectorizer thinks that %X %Y %Z and %W alias, so it tries to perform 4 scalar store operations (which is a bad idea).  We need to figure out why AA thinks that X and Y may alias.  Maybe there is a problem with the code that uses AA. 
> 

Thanks for the tip. opt was using  NoAliasAnalysis by default, so passing
-basicaa to opt gives me this now:

SLP: Analyzing blocks in vector.
SLP: Found 5 stores to vectorize.
SLP: Analyzing a store chain of length 4.
SLP: Analyzing a store chain of length 4
SLP: Analyzing 4 stores at offset 0
SLP: Checking users of    store i32 0, i32* %x. 
SLP: Checking users of    store i32 1, i32* %y. 
SLP: Checking users of    store i32 2, i32* %z. 
SLP: Checking users of    store i32 3, i32* %w. 
SLP: We are able to schedule this bundle.
SLP: added a vector of stores.
SLP: Gathering due to C,S,B,O. 
SLP: Calculating cost for tree of size 2.
SLP: Check whether the tree with height 2 is fully vectorizable .
SLP: Adding cost -3 for bundle that starts with   store i32 0, i32* %x .
SLP: Adding cost 0 for bundle that starts with i32 0 .
SLP: Total Cost -3.
SLP: Found cost=-3 for VF=4
SLP: Decided to vectorize cost=-3
SLP: Extracting 0 values .
SLP:    Erasing scalar:  store i32 0, i32* %x.
SLP:    Erasing scalar:  store i32 1, i32* %y.
SLP:    Erasing scalar:  store i32 2, i32* %z.
SLP:    Erasing scalar:  store i32 3, i32* %w.
SLP: Optimizing 0 gather sequences instructions.
SLP: vectorized "vector"
; ModuleID = 'vector-alloca.ll'
target datalayout =
"e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-v16:16:16-v24:32:32-v32:32:32-v48:64:64-v64:64:64-v96:128:128-v128:128:128-v192:256:256-v256:256:256-v512:512:512-v1024:1024:1024-v2048:2048:2048-n32:64"
target triple = "r600--"

define void @vector(i32 addrspace(1)* %out, i32 %index) {
entry:
  %0 = alloca [4 x i32]
  %x = getelementptr [4 x i32]* %0, i32 0, i32 0
  %y = getelementptr [4 x i32]* %0, i32 0, i32 1
  %z = getelementptr [4 x i32]* %0, i32 0, i32 2
  %w = getelementptr [4 x i32]* %0, i32 0, i32 3
  %1 = bitcast i32* %x to <4 x i32>*
  store <4 x i32> <i32 0, i32 1, i32 2, i32 3>, <4 x i32>* %1
  %2 = getelementptr [4 x i32]* %0, i32 0, i32 %index
  %3 = load i32* %2
  store i32 %3, i32 addrspace(1)* %out
  ret void
}

The next step for me is to figure out how to replace the GEP + load with an
extractelement.

-Tom

> Thanks,
> Nadav
> 
> 
> On Oct 24, 2013, at 2:04 PM, Tom Stellard <tom at stellard.net> wrote:
> 
> > Hi,
> > 
> > I've been playing around with the SLPVectorizer trying to get it to
> > vectorize this simple program:
> > 
> > define void @vector(i32 addrspace(1)* %out, i32 %index) {
> > entry:
> >  %0 = alloca [4 x i32]
> >  %x = getelementptr [4 x i32]* %0, i32 0, i32 0
> >  %y = getelementptr [4 x i32]* %0, i32 0, i32 1
> >  %z = getelementptr [4 x i32]* %0, i32 0, i32 2
> >  %w = getelementptr [4 x i32]* %0, i32 0, i32 3
> >  store i32 0, i32* %x
> >  store i32 1, i32* %y
> >  store i32 2, i32* %z
> >  store i32 3, i32* %w
> >  %1 = getelementptr [4 x i32]* %0, i32 0, i32 %index
> >  %2 = load i32* %1
> >  store i32 %2, i32 addrspace(1)* %out
> >  ret void
> > }
> > 
> > My goal is to have this program transformed to the following:
> > 
> > define void @vector(i32 addrspace(1)* %out, i32 %index) {
> > entry:
> >  %0 = extractelement <4 x i32> <i32 0, i32 1, i32 2, i32 3>, i32 %index
> >  store i32 %0, i32 addrspace(1)* %out
> > }
> > 
> > I've slightly modified the SLPVectorizer (see the attached patch) so
> > that it will vectorize small trees, and I've also fixed a crash in the
> > BoUpSLP::Gather() function when it is passed a list of store
> > instructions.  With this patch, the command:
> > 
> > opt -slp-vectorizer -debug -march=r600 -mcpu=redwood -o - vector-alloca.ll -S -slp-threshold=-20
> > 
> > Produces the following output and the program remains unchanged:
> > 
> > ====
> > 
> > SLP: Analyzing blocks in vector.
> > SLP: Found 5 stores to vectorize.
> > SLP: Analyzing a store chain of length 4.
> > SLP: Analyzing a store chain of length 4
> > SLP: Analyzing 4 stores at offset 0
> > SLP: Checking users of    store i32 0, i32* %x. 
> > SLP: Checking users of    store i32 1, i32* %y. 
> > SLP: Checking users of    store i32 2, i32* %z. 
> > SLP: Checking users of    store i32 3, i32* %w. 
> > SLP: We are able to schedule this bundle.
> > SLP: Can't sink   store i32 0, i32* %x
> > down to   store i32 3, i32* %w
> > because of   store i32 1, i32* %y.  Gathering.
> > SLP: Calculating cost for tree of size 1.
> > SLP: Check whether the tree with height 1 is fully vectorizable .
> > SLP: Adding cost 4 for bundle that starts with   store i32 0, i32* %x .
> > SLP: Total Cost 4.
> > SLP: Found cost=4 for VF=4
> > SLP: Decided to vectorize cost=4
> > SLP: Extracting 0 values .
> > SLP: Optimizing 0 gather sequences instructions.
> > SLP: vectorized "vector"
> > 
> > ====
> > 
> > I'm having a little trouble figuring out why the stores do not end up
> > being vectorized.  Does anyone have any insight into this?  Should this
> > pass be able to perform the desired transformation?
> > 
> > Thanks,
> > Tom
> > 
> > <slp-vectorize-alloc.patch>_______________________________________________
> > LLVM Developers mailing list
> > LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>