[PATCH] [LoopVectorize]Teach Loop Vectorizer about interleaved memory access

Wed May 6 04:09:52 PDT 2015

Hi,

Thanks very much for the code review. 
I've updated the code according to most of the comments.
I just can not complete the following two comments by the end of today:

  (1) To add a new cost model function in BasicTTIImpl.h
  (2) To handle the redundant load/store case.

I'll do in the next patch.

Thanks,
-Hao


================
Comment at: include/llvm/Analysis/TargetTransformInfoImpl.h:313
@@ +312,3 @@
+    // lanes.
+    Cost += VecType->getNumElements();
+    return Cost;
----------------
hfinkel wrote:
> This is not really the right place for this default, and not the right way to generate it. Just return 1 here.
> 
> The default should be in include/llvm/CodeGen/BasicTTIImpl.h, and you should call getScalarizationOverhead, likely, to generate the default cost.
I've removed such code.
As I'm not familiar with the cost model, I'll add the cost later.

================
Comment at: lib/Analysis/LoopAccessAnalysis.cpp:533
@@ -532,2 +532,3 @@
           *Ptr << " SCEV: " << *PtrScev << "\n");
+    return 0;
   }
----------------
hfinkel wrote:
> Is this just a drive-by bug fix?
Yes. I just came across this minor bug.
As it's quite obvious and simple, should I just commit this fix?

================
Comment at: lib/Analysis/LoopAccessAnalysis.cpp:782
@@ +781,3 @@
+  unsigned NumIterations = ForcedFactor * ForcedUnroll;
+  // The number of vectorized & interleaved iterations should not be less than 2
+  if (NumIterations < 2)
----------------
rengolin wrote:
> why not?
Sorry that my comment is confusing.

I mean the "ForcedFactor * ForcedUnroll" will never be 1. 
    (1) If the VectorizationFactor and VectorizationInterleave are both set to be 1 by user, this is the same to disable the loop vectorizer. It will never execute to this point.
    (2) If either of the Factors is forced, the number of iterations will also never be 1. See if user set "-force-vector-interleaved=1", the vector factor will be at least 2. Or if user set "-force-vector-width=1", the unroll factor will be at least 2.

So I set the number of iterations to be 2 if either factors is forced. This can simplify the following distance checks.

I've changed the comments to make it more clear.



================
Comment at: lib/Analysis/LoopAccessAnalysis.cpp:790
@@ +789,3 @@
+  //   (2) Distance >= TypeByteSize * NumIterations * Stride
+  // is true and
+  //   (3) Distance <= MaxSafeDepDistBytes
----------------
rengolin wrote:
> This comment is confusing. You mean to say that 3 always has to be true, while 1 or 2 should be true. I'd rephrase this as:
> 
>     // Positive distance is safe when:
>     //    Distance <= MaxSafeDepDistBytes
>     // And either one below is true:
>     //   * Distance < Stride * TypeByteSize
>     //   * Distance >= TypeByteSize * NumIterations * Stride
Done.

================
Comment at: lib/Analysis/LoopAccessAnalysis.cpp:804
@@ -795,2 +803,3 @@
+    MaxSafeDepDistBytes = Distance;
 
   bool IsTrueDataDependence = (!AIsWrite && BIsWrite);
----------------
aschwaighofer wrote:
> MaxSafeDepDistBytes records the maximum distance that is still safe.
> 
> Say we saw a maximum safe distance of 4 bytes before. Now we process a new pair of access and the distance would be 8 bytes and that is still bigger-equal than "TypeByteSize * NumIterations * Stride". It is not correct to say our MaxSafeDepDistBytes is 8 bytes now.
Hi Arnold,

Your example is not true.
We've already checked "Distance <= MaxSafeDepDistBytes", so the Distance is always equal or less than 4 bytes in your case.

I've modified the comments and code to make it more clear:
  // If Distance < Stride * TypeByteSize, it is always safe, just keep current
  // MaxSafeDepDistBytes.
  // Otherwise need to update the MaxSafeDepDistBytes to be Distance. (We've
  // already checked that "Distance <= MaxSafeDepDistBytes").


================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:164
@@ +163,3 @@
+static cl::opt<bool>
+    VectorizeInterleavedAccess("vectorize-interleaved-access", cl::init(true),
+                               cl::Hidden,
----------------
rengolin wrote:
> hfinkel wrote:
> > rengolin wrote:
> > > If we need a flag for the moment, it's to make it false by default, so we can turn it on when we want, not the other way around. Later, if the pass has proven correct, we should make it on by default. This transformation won't change unit stride loops anyway, so I don't see why we would need to disable it even temporarily.
> > > 
> > > Another note, I don't think we need the extensive comment here, just one line line the others will be fine. Nor we should document *how* we vectorize, since if that changes, the comment will be automatically obsolete, and probably never changed.
> > I disagree, the comment is good and we definitely should have it. However, it should not be here, but with the implementation (where it is also much less likely to get out of sync).
> > 
> > Luckily, it seems that we already have such comments below, so this might be just removed. However, I don't want us to give the impression that extensive explanatory comments are undesirable in general.
> We should have that kind of documentation, yes, but in the right place. Here, just one line would do. On other methods, just the part that relate to them. In the entry point, a more top view would be good, possibly with the C->IR part.
Done.

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:704
@@ +703,3 @@
+          return i;
+      return Members.size();
+    }
----------------
rengolin wrote:
> You're already making sure the item is there. I think you should have an llvm_unreachable() here.
Done.

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:1833
@@ +1832,3 @@
+// If the 1st vector has more elements, extend the 2nd vector with UNDEFs.
+static Value *ConcateTwoVectors(IRBuilder<> &Builder, Value *V1, Value *V2) {
+  VectorType *VecTy1 = dyn_cast<VectorType>(V1->getType());
----------------
rengolin wrote:
> s/Concate/Concatenate/
> 
> I thought we had this kind of functionality... How is Elena doing the same for her indexed loads?
Done.

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:3751
@@ -3416,1 +3750,3 @@
 
+static bool isInBoundsGep(Value *Ptr) {
+  if (GetElementPtrInst *GEP = dyn_cast<GetElementPtrInst>(Ptr))
----------------
rengolin wrote:
> Verbatim copy is not right. This is too generic to not be exposed by some GEP or Builder class.
Done.

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:3758
@@ +3757,3 @@
+/// \brief Check whether the access through \p Ptr has a constant stride.
+static int isStridedPtr(ScalarEvolution *SE, Value *Ptr, const Loop *Lp,
+                        const SCEV *&PtrScev) {
----------------
rengolin wrote:
> Non-Verbatim copy is even worse. Please make sure that you can use lib/Analysis' isStridedPtr.
Done.

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:3934
@@ +3933,3 @@
+
+  for (SetVector<Instruction *>::iterator it = Heads.begin(), e = Heads.end();
+       it != e; ++it) {
----------------
hfinkel wrote:
> You can use auto here for the iterator type, a range-based for would be better.
Done.

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:3941
@@ +3940,3 @@
+
+    // Collect a list with consecutive accesses.
+    while (Tails.count(I) || Heads.count(I)) {
----------------
aschwaighofer wrote:
> What does this code do if you have two interleaved accesses to the same location? (I know we should not see this in optimized IR, but we can't rely on this fact)
> 
> 
> a[2i] = ; // (1)
> a[2i] = ; // (2)
> a[2i+1] = ; // (3)
> a[2i+1] = ; // (4)
> 
> 
> Are we guaranteed to get the pairs (1) (3) and (2)(4)?
> 
> It seems to me that we will get the last consecutive location since we process 'Queue' front to back in the loop above that populates ConsecutiveChain; in my example we would get (1), (4)
> 
> 
Arnold,

You are right, we can not get pairs (1)(3) and (2)(4). Actually it get two pairs: (1)(4) and (2)(4), as the later node overrides the former node.

Previously, I never considered about the redundant loads/stores. Now it seems error prone, as the newly generated instruction may break the dependence. I'm still thinking about how to fix this.



================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:4039
@@ +4038,3 @@
+      if (TryMerge) {
+        collectInterleavedAccess(AccessQueue, CurStride, IsLoad);
+        CurBase = nullptr;
----------------
rengolin wrote:
> Potential uninitialized use of CurStride.
I've initialized CurStride to be 1.

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:4081
@@ +4080,3 @@
+    // Collect the tail accesses.
+    if (CurBase != nullptr) {
+      collectInterleavedAccess(AccessQueue, CurStride, IsLoad);
----------------
rengolin wrote:
> You just need
> 
>     if (CurBase) {
Done.

http://reviews.llvm.org/D9368

EMAIL PREFERENCES
  http://reviews.llvm.org/settings/panel/emailpreferences/