[PATCH] [LoopVectorize]Teach Loop Vectorizer about interleaved memory access

Wed May 13 21:26:34 PDT 2015

Hi Hao,

Thanks for the updated patch, it looks better!

Below are some comments from me: most of them are nitpicks and typo-corrections, but a couple of them are important ones.

Michael


================
Comment at: include/llvm/Analysis/LoopAccessAnalysis.h:384
@@ +383,3 @@
+    bool isReverse() const { return Reverse; }
+    unsigned getStride() const { return Stride; };
+    unsigned getAlign() const { return Align; }
----------------
s/};/}/

================
Comment at: include/llvm/Analysis/LoopAccessAnalysis.h:395-399
@@ +394,7 @@
+
+    // Return true if successfully insert a new member. \p Index is relative
+    // to the leader (The member with index 0), so it could be negative.
+    // This could fail if it breaks two rules below:
+    //     1. The max index is always less than the stride.
+    //     2. There is already a member with same index.
+    bool insertMember(Instruction *Inst, int Index, unsigned Alignment) {
----------------
Could you please convert the comments to doxygen format?
(See http://llvm.org/docs/CodingStandards.html#doxygen-use-in-documentation-comments )

================
Comment at: include/llvm/Analysis/LoopAccessAnalysis.h:401-414
@@ +400,16 @@
+    bool insertMember(Instruction *Inst, int Index, unsigned Alignment) {
+      unsigned AbsIdx = std::abs(Index);
+      if (Index < 0) {
+        unsigned LIndex = Members.back().second;
+        // Break rule 1. Doesn't belong to this group.
+        if (LIndex + AbsIdx >= Stride)
+          return false;
+
+        for (auto &I : Members)
+          I.second += AbsIdx;
+
+        Members.insert(Members.begin(), std::make_pair(Inst, 0));
+        Align = std::min(Align, Alignment);
+        return true;
+      }
+
----------------
A couple of comments regarding this code:
1) Let's consistently use either `Idx` or `Index`.
2) Do we need `AbsIdx` at all? Under the if we can use `-Index` instead of it, after the if - just `Index`.
3) In the paper you mentioned it was stated that the grouping algorithm is linear in the number of accesses. In your current implementation it's quadratic, and the complexity seems to come from two factors: 1) we recalculate indexes after adding a new member, 2) we explicitly check if the group already contains a member with the same index. Could we do anything about that? E.g. I think that recalculating isn't necessary at all, and the check for duplicates can be done faster if we use a hash-table for it.

================
Comment at: include/llvm/Analysis/LoopAccessAnalysis.h:431
@@ +430,3 @@
+      Members.insert(I, std::make_pair(Inst, AbsIdx));
+      Align = std::min(Align, Alignment);
+      return true;
----------------
Not a good naming - from the names it's absolutely unclear what's the difference between `Align` and `Alignment`, and what they stand for.

================
Comment at: include/llvm/Analysis/LoopAccessAnalysis.h:436
@@ +435,3 @@
+    void removeMember(Instruction *Inst) {
+      for (auto I = Members.begin(), E = Members.end(); I != E; ++I) {
+        if (I->first != Inst)
----------------
Use range loop?

================
Comment at: lib/Analysis/LoopAccessAnalysis.cpp:1037
@@ -1029,3 +1036,3 @@
 
-  // Now we have two lists that hold the loads and the stores.
-  // Next, we find the pointers that they use.
+  analyzeDependences(Loads, Stores, IsAnnotatedParallel, Strides);
+
----------------
Factoring out this function could go in a separate patch.

================
Comment at: lib/Analysis/LoopAccessAnalysis.cpp:1074-1076
@@ -1051,5 +1073,5 @@
 
-  ValueVector::iterator I, IE;
-  for (I = Stores.begin(), IE = Stores.end(); I != IE; ++I) {
-    StoreInst *ST = cast<StoreInst>(*I);
+  // Find the pointers that stores use.
+  for (auto I : Stores) {
+    StoreInst *ST = cast<StoreInst>(I);
     Value* Ptr = ST->getPointerOperand();
----------------
This change can go in a separate patch.

================
Comment at: lib/Analysis/LoopAccessAnalysis.cpp:1104-1106
@@ -1081,4 +1103,5 @@
 
-  for (I = Loads.begin(), IE = Loads.end(); I != IE; ++I) {
-    LoadInst *LD = cast<LoadInst>(*I);
+  // Find the pointers that loads use.
+  for (auto I : Loads) {
+    LoadInst *LD = cast<LoadInst>(I);
     Value* Ptr = LD->getPointerOperand();
----------------
And this change can go in a separate patch.

================
Comment at: lib/Analysis/LoopAccessAnalysis.cpp:1253
@@ +1252,3 @@
+
+    // An alignment of 0 means target abi alignment.
+    unsigned Align = LI ? LI->getAlignment() : SI->getAlignment();
----------------
s/abi/ABI/

================
Comment at: lib/Analysis/LoopAccessAnalysis.cpp:1265
@@ +1264,3 @@
+  //   1. A and B have the same stride.
+  //   2. A and B has the same memory object size.
+  //   3. The distance of B to the leader of the interleaved access is less
----------------
s/has/have/

================
Comment at: lib/Analysis/LoopAccessAnalysis.cpp:1272
@@ +1271,3 @@
+  // already checked there is no overlap between pointers of different base.
+  // For the pointers sharing the same base. There may be three dependeces:
+  //
----------------
s/base. There/base there/
s/dependeces/dependences/



================
Comment at: lib/Analysis/LoopAccessAnalysis.cpp:1283
@@ +1282,3 @@
+  //            A[i] = b; (2)
+  // The bottom-up search order gurantees (2) is always inserted into the
+  // interleaved store below the interleaved store that (1) belongs to.
----------------
s/gurantees/guarantees/

================
Comment at: lib/Analysis/LoopAccessAnalysis.cpp:1289
@@ +1288,3 @@
+  //            A[i] = b;
+  // The insert positon of interleaved store and interleaved load can gurantee
+  // the write is always after the read.
----------------
s/gurantee/guarantee/

================
Comment at: lib/Analysis/LoopAccessAnalysis.cpp:1307
@@ +1306,3 @@
+
+    for (auto II = std::next(I); II != E; ++II) {
+      Instruction *B = *II;
----------------
I think I haven't understood completely how the paper suggests to do the grouping in a linear time, but the given implementation looks quadratic. If we are ok to keep it such, we need to at least add some limits so that we won't blow up compile time in case of huge loop bodies with mass of interleaved accesses.

================
Comment at: lib/Analysis/LoopAccessAnalysis.cpp:1327
@@ +1326,3 @@
+
+      int DistanceToA = C->getValue()->getValue().getZExtValue();
+      // Read/write the same location.
----------------
Shouldn't it be `getSExtValue` since the distance can be negative?

================
Comment at: lib/Analysis/LoopAccessAnalysis.cpp:1342
@@ +1341,3 @@
+        InterleavedAccessMap[B] = IA;
+      continue;
+    }
----------------
This is redundant.

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:4842
@@ +4841,3 @@
+    if (Legal->isInterleaved(I)) {
+      auto IA = Legal->getInterleavedAccess(I);
+      Type *WideVecTy =
----------------
I'd add an assert that `IA` != `nullptr` here.

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:4848-4850
@@ +4847,5 @@
+      // Calculate the cost of the interleaved access for this member.
+      unsigned Cost = TTI.getInterleavedMemoryOpCost(
+          I->getOpcode(), WideVecTy, VectorTy, IA->getIndex(I),
+          IA->getNumGaps(), IA->getAlign(), AS);
+
----------------
Do we need to add the cost once per interleaving group, or per interleaving group member?

http://reviews.llvm.org/D9368

EMAIL PREFERENCES
  http://reviews.llvm.org/settings/panel/emailpreferences/