[PATCH] [LoopVectorize]Teach Loop Vectorizer about interleaved memory access

Tue May 19 23:52:52 PDT 2015

> Also, I think it's a good idea to separate the part for LoopAccessAnalysis part into a separate patch. I think it's fine to keep it in one patch during review (to keep the history), but I'd suggest committing it separately when they are ready. It'll help tracking down any potential issues that could be exposed later.

Michael, I agree with you and I'll separate the patch when committing.

I've refactored the patch according to comments from Michael.

Review please.

Thanks,
-Hao

================
Comment at: include/llvm/Analysis/LoopAccessAnalysis.h:325
@@ +324,3 @@
+      : Align(Align), SmallestKey(0), LargestKey(0), InsertPos(nullptr) {
+    assert(Align && "The alignment should be non-zero");
+
----------------
mzolotukhin wrote:
> We require `Align!=0` here, but in the constructor below we explicitly set it to 0. Is one of these unused and could be removed?
Removed the unused constructor.

================
Comment at: include/llvm/Analysis/LoopAccessAnalysis.h:495-496
@@ +494,4 @@
+  /// \returns the newly created interleave group.
+  InterleaveGroup *createInterleaveGroup(Instruction *Instr, int Stride,
+                                                unsigned Align) {
+    assert(!InterleaveGroupMap.count(Instr) &&
----------------
mzolotukhin wrote:
> Please fix the indentation here.
Fixed.

================
Comment at: include/llvm/Analysis/TargetTransformInfo.h:448
@@ +447,3 @@
+  /// \return The cost of one member in the interleaved access.
+  /// \p Vecty is the wide vecotor type of the whole interleaved access.
+  /// \p SubTy is the vector type of the member.
----------------
mzolotukhin wrote:
> s/vecotor/vector/
Done.

================
Comment at: include/llvm/Analysis/TargetTransformInfo.h:449-451
@@ +448,5 @@
+  /// \p Vecty is the wide vecotor type of the whole interleaved access.
+  /// \p SubTy is the vector type of the member.
+  /// \p Index is the index of current member in the interleaved access.
+  /// \p Gap is the number of missing members
+  unsigned getInterleavedMemoryOpCost(unsigned Opcode, Type *VecTy,
----------------
mzolotukhin wrote:
> This comment seems to be outdated - there are no arguments `SubTy`, `Index`, or `Gap` in the declaration below.
Fixed.

================
Comment at: include/llvm/Analysis/TargetTransformInfoImpl.h:306
@@ +305,3 @@
+                                      unsigned AddressSpace) {
+    return Delta;
+  }
----------------
mzolotukhin wrote:
> Why the cost is `Delta`?
Previously I though there are 'Delta' instructions each costs 1.
Regarding all the default cost in other getXXXCost is 1, I've changed it to be "1".

================
Comment at: include/llvm/CodeGen/BasicTTIImpl.h:546-547
@@ +545,4 @@
+
+      for (unsigned i = 0; i < NumElts; i++)
+        Cost += getVectorInstrCost(Instruction::ExtractElement, VT, i);
+
----------------
mzolotukhin wrote:
> Am I getting it correctly, that `VecTy` is the type of 'combined' vector?
> I.e. if we have a stride=2, VF=4, and ScalarTy=i32, then `VecTy` would be `<8 x i32>`, and `SubVT` would be `<4 x i32>`? I'm not sure I've completely understood this part, so please correct me if I'm wrong.
> 
> If that's true, then the cost computation doesn't seem correct. Extract 1st or 2nd element from the wide vector isn't the same as extracting odds or even elements.
Yes, you are right.

The cost is not only "Extract" cost, it also plus the "Insert" cost.

Say we interleaved load two vectors:
     %vec = load <8 x i32>, <8 x i32>* %ptr
     %v0 = shuffle %vec, undef, <0, 2, 4, 6>
     %v1 = shuffle %vec, undef, <1, 3, 5, 7>

The cost consist of 2 parts:
      (1) load of <8 x i32>
      (2) extract %v0 and %v1 from %vec.

The cost of (2) is estimated as 2 parts:
      (a) extract elements at: 0, 1, 2, 3, 4, 5, 6, 7
      (b) insert elements 0, 2, 4, 6 into %v0 and insert elements 1, 3, 5, 7 into %v1

The interleaved store is similar.

================
Comment at: include/llvm/CodeGen/BasicTTIImpl.h:551
@@ +550,3 @@
+      for (unsigned i = 0; i < NumSubElts; i++)
+        InsSubCost = getVectorInstrCost(Instruction::InsertElement, SubVT, i);
+
----------------
mzolotukhin wrote:
> Was it supposed to be `+=`?
Fixed.

================
Comment at: include/llvm/CodeGen/BasicTTIImpl.h:558-564
@@ +557,9 @@
+
+      unsigned ExtSubCost = 0;
+      for (unsigned i = 0; i < NumSubElts; i++)
+        ExtSubCost = getVectorInstrCost(Instruction::ExtractElement, SubVT, i);
+      Cost += Delta * ExtSubCost;
+
+      for (unsigned i = 0; i < NumElts; i++)
+        Cost += getVectorInstrCost(Instruction::InsertElement, VT, i);
+    }
----------------
mzolotukhin wrote:
> What's the difference between computing costs for stores and loads?
Say we interleaved store two vectors:
     %v0_v1 = shuffle %v0, %v1, <0, 4, 1, 5, 2, 6, 3, 7>
     store <8 x i32> %v0_v1, <8 x i32>*
It has two main differences:
(1) The direction is different. It firstly extract elements from two sub vectors and then insert into a wide vector (interleaved Load firstly extract elements from a wide vector and then insert into sub vectors)
(2) The interleaved store doesn't allow gaps, which means we must have member of index 0 (with even elements) and member of index 1 (with odd elements). But interleaved load allows gaps, it could only load even elements such as:
     %vec = load <8 x i32>, <8 x i32>* %ptr
     %v0 = shuffle <8 x i32> %vec, undef, <0, 2, 4, 6>
This is still beneficial and cheaper than scalar loads.

I've refactored this function to add a new parameter of indices for interleaved load.

http://reviews.llvm.org/D9368

EMAIL PREFERENCES
  http://reviews.llvm.org/settings/panel/emailpreferences/