[PATCH] D19501: Add LoadStoreVectorizer pass

Mon Jun 20 13:42:39 PDT 2016

asbirlea added inline comments.

================
Comment at: lib/Transforms/Vectorize/LoadStoreVectorizer.cpp:304
@@ +303,3 @@
+  }
+}
+
----------------
asbirlea wrote:
> jlebar wrote:
> > Talking to asbirlea, we are not convinced this is safe.  In particular, we cannot safely reorder across instructions that may read or may write arbitrary memory, nor can we safely reorder loads and stores that are not in the chain but which nonetheless may alias each other.
> > 
> > I think we're safe wrt instructions which may read or write, because these will never appear between a chain's first and last instructions.  (I would feel much better if we had an assert to this effect, though.)
> > 
> > But I am not convinced we're safe wrt regular loads and stores whose addresses depend on the result of a vectorized load.  For example, imagine we have
> > 
> >   b = load a
> >   load [b]  // b does not alias a
> >   store [b]
> >   c = load a+1
> > 
> > Suppose that we choose to vectorize the first and last load and WLOG assume we insert the vectorized load at the position of c:
> > 
> >   load [b]
> >   store [b]
> >   {b, c} = load {a, a+1}
> > 
> > Now we call reorder() to fix this up.  The first two instructions are going to be moved below the vectorized load in the reverse order of I->uses().  AFAICT there is no guarantee on the order of uses(), but even if there were, outputting in the opposite order seems probably not right?
> > 
> > Even if that were somehow right, afaict we can make reorder() do arbitrary, unsafe reorderings of loads/stores with the following idiom.
> > 
> >   b = load a
> >   b1 = b + 0 // copy b to b1
> >   b2 = b + 0 // copy b to b2
> >   load [b1]  // b1 does not alias a
> >   store [b2]  // b2 does not alias a
> >   c = load a+1
> > 
> > Now if the order of b1 and b2 in the source determines their order in users(), then that controls the order of "load [b1]" and "store [b2]" in the result, even though there's only one correct ordering of that load and store.
> > 
> > So to summarize, it may be my ignorance, but I don't see why we can rely on there being any particular order to users().  But even if users() has some guaranteed order, I don't see how any ordering could guarantee that we never reorder an aliasing load/store pair.
> Tried testing this out using jlebar's first example.
> 
> Input: 
> ```
> %struct.buffer_t = type { i64, i8*, [4 x i32], [4 x i32], [4 x i32], i32, i8, i8, [2 x i8] }       
>   
> define i32 @test(%struct.buffer_t* noalias %buff) #0 {                                             
> entry:
>   %tmp1 = getelementptr inbounds %struct.buffer_t, %struct.buffer_t* %buff, i64 0, i32 1           
>   %buff.host = load i8*, i8** %tmp1, align 8                                                       
>   %buff.val = load i8, i8* %buff.host, align 8                                                     
>   store i8 0, i8* %buff.host, align 8                                                              
>   %tmp0 = getelementptr inbounds %struct.buffer_t, %struct.buffer_t* %buff, i64 0, i32 0           
>   %buff.dev = load i64, i64* %tmp0, align 8                                                        
>   ret i32 0
> } 
>   
> attributes #0 = { nounwind }
> ```
> 
> Output:
> 
> ```
> %struct.buffer_t = type { i64, i8*, [4 x i32], [4 x i32], [4 x i32], i32, i8, i8, [2 x i8] }       
>   
> ; Function Attrs: nounwind
> define i32 @test(%struct.buffer_t* noalias %buff) #0 {                                             
> entry:
>   %tmp0 = getelementptr inbounds %struct.buffer_t, %struct.buffer_t* %buff, i64 0, i32 0           
>   %0 = bitcast i64* %tmp0 to <2 x i64>*                                                            
>   %1 = load <2 x i64>, <2 x i64>* %0, align 8                                                      
>   %2 = extractelement <2 x i64> %1, i32 0
>   %3 = extractelement <2 x i64> %1, i32 1                                                          
>   %4 = inttoptr i64 %3 to i8*                                                                      
>   store i8 0, i8* %4, align 8
>   %buff.val = load i8, i8* %4, align 8                                                             
>   ret i32 0
> } 
>   
> attributes #0 = { nounwind }
> ```
> 
> The first and last load got vectorized and moved at the beginning, but the uses that now follow are in reversed order. 
> 
> The input had load [b], store [b]. The output has store[b], load[b].
> 
FWIW, the example above can be fixed by keeping a temporary of the last instruction inserted.  For example:

```
Instruction* InsertAfter=I;
for (User *U : I->users())                                                                       
    if (Instruction *User = dyn_cast<Instruction>(U))                                              
      if (!DT.dominates(I, User) && User->getOpcode() != Instruction::PHI) {                       
        User->removeFromParent();                                                                  
        User->insertAfter(InsertAfter);                                                            
        reorder(User);                                                                             
        InsertAfter = User;                                                                        
      } 
```

However, this does not account for the recursive inserts AND it assumes the users() are in BB order.

A correct and __very__ iffy way: 1. create a set (include the recursive call here) of all (transitive) users, 2. make a pass over __all__ instructions in BB; if instruction is in the collected set insert it after the latest inserted instruction (as in above sample); or just go over the BB in reverse order. 

I would favor avoiding the reorder entirely than having a pass over the BB at every reorder..

http://reviews.llvm.org/D19501