[PATCH] [LoopVectorize]Teach Loop Vectorizer about interleaved memory access

Thu May 21 23:45:33 PDT 2015

I've updated a new patch refactored according to Michael's comments.

Review please.

Thanks,
-Hao

In http://reviews.llvm.org/D9368#176830, @mzolotukhin wrote:

> Hi Hao,
>
> Please find more remarks from me inline, andI think this is the last bunch of comments from me. With them properly addressed I'm fine with committing the patch. However, I'd also like to hear from Adam regarding the LoopAccessAnalysis part, since he's been working actively on it.
>
> And again, thanks for working on this, it's a much-needed feature!
>
> Michael

================
Comment at: lib/Analysis/LoopAccessAnalysis.cpp:988
@@ +987,3 @@
+
+    for (auto II = std::next(I); II != E; ++II) {
+      Instruction *B = II->first;
----------------
mzolotukhin wrote:
> I think you missed my question earlier on this part.
> 
> This algorithm is clearly quadratic, which is dangerous for compile time. We need to either make it linear, or add some limits (e.g. if number of accesses > 100, give up and don't even try to group them).
> 
> Also, if we separate it into two phases: 1) grouping accesses, 2) finding write-after-write pairs, I think we can make it more efficient. Since it's sufficient to compare against any access from a group, we'll be quadratic on the number of groups, not the number of accesses. And when we look for WaW pairs, we can only look only inside a group, not across all accesses in the loop. Does it sound reasonable, or am I missing something?
> 
For the first question. Previously, I thought you meant to improve the insert member method.
Anyway, I think the algorithm is quadratic, the paper says:
    This relies on the pivoting order in which the pairs of accesses are processed and  
    reduces the complexity of the grouping algorithm: we iterate over O(n^2 ) pairs of
    loads and stores (where n is the number of loads and stores)

But I think we don't need to worry about compile time:
    (1) The "n" is only the number of stride loads/stores (|Stride| > 1)
    (2) For a case with a lot of loads and stores, it possiblely can not pass previous dependence check, because it could have a lot of runtime memory check or have memory conflict. So it returns early.

On the other hand, if a loop indeed can pass dependence check. say if it has 100 loads, we should analyze the interleave accesses as vectorization on the possible interleave groups can really give us a lot of benefit.

================
Comment at: lib/Analysis/LoopAccessAnalysis.cpp:988
@@ +987,3 @@
+
+    for (auto II = std::next(I); II != E; ++II) {
+      Instruction *B = II->first;
----------------
HaoLiu wrote:
> mzolotukhin wrote:
> > I think you missed my question earlier on this part.
> > 
> > This algorithm is clearly quadratic, which is dangerous for compile time. We need to either make it linear, or add some limits (e.g. if number of accesses > 100, give up and don't even try to group them).
> > 
> > Also, if we separate it into two phases: 1) grouping accesses, 2) finding write-after-write pairs, I think we can make it more efficient. Since it's sufficient to compare against any access from a group, we'll be quadratic on the number of groups, not the number of accesses. And when we look for WaW pairs, we can only look only inside a group, not across all accesses in the loop. Does it sound reasonable, or am I missing something?
> > 
> For the first question. Previously, I thought you meant to improve the insert member method.
> Anyway, I think the algorithm is quadratic, the paper says:
>     This relies on the pivoting order in which the pairs of accesses are processed and  
>     reduces the complexity of the grouping algorithm: we iterate over O(n^2 ) pairs of
>     loads and stores (where n is the number of loads and stores)
> 
> But I think we don't need to worry about compile time:
>     (1) The "n" is only the number of stride loads/stores (|Stride| > 1)
>     (2) For a case with a lot of loads and stores, it possiblely can not pass previous dependence check, because it could have a lot of runtime memory check or have memory conflict. So it returns early.
> 
> On the other hand, if a loop indeed can pass dependence check. say if it has 100 loads, we should analyze the interleave accesses as vectorization on the possible interleave groups can really give us a lot of benefit.
For the second question. I think it is slightly different. We should not only consider about Group and Group, a group and a single store may break the dependence.
Say,   A[i] = a;          // (1)
          B[i] = b;          // (2)
          A[i+1] = c;      // (3)
The combine of (1) and (3) may break the dependence on "A[i] - B[i]" pair even though B[i] is not in a group.
So I think we should search all the write-write pairs. If we separate into two phases, we'll have to calculate the distance again. I think it may increase the compile time.

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:1823
@@ +1822,3 @@
+  unsigned Index = Group->getIndex(Instr);
+  for (unsigned Part = 0; Part < UF; Part++) {
+    // Notice current instruction could be any index. Need to adjust the address
----------------
mzolotukhin wrote:
> Should we iterate to `VF` instead of `UF` here? `PtrParts` contains `VF` elements, right?
Slightly different.
'PtrParts' contains UF vectors, and each vector has VF elements.
Here, firstly we want the pointer vector in current unroll part, I.e. 'PtrParts[Part]'.
Then for each pointer vector ''PtrParts[Part]', we want the pointer in lane 0.

E.g.
     for (unsigned i = 0; i < 1024; i+=2) {
          A[i] = i;
     }
If VF is 4 and UF is 2. Then we'll have two unroll parts:
part 1:
      A[0] = 0;
      A[2] = 2;
      A[4] = 4;
      A[6] = 6;

part 2:
      A[8] = 8;
      ...
For this case, 'PtrParts' contains two vectors:
     Vector 1 consists pointers to: A[0], A[2], A[4], A[6]
     Vector 2  consists pointers to: A[8], A[10], A[12], A[14]

Then we extract lane 0 from Vector 1 to get a pointer pointing to A[0], so that we can use it to load A[0-7] and after that, we can extract A[0, 2, 4, 6] from A[0-7].

If the loads/stores are reverse, we need the pointer in last lane. E.g. the access sequences will be "A[1022], A[1020], A[1018], A[1016]", the last lane is the pointer to A[1016]". We can load A[1016-1023] to extract A[1016, 1018, 1020, 1022].

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:1835-1836
@@ +1834,4 @@
+    // Current pointer is pointed to A[i+2], adjust it to A[i].
+    Value *NewPtr =
+        Builder.CreateExtractElement(PtrParts[Part], Builder.getInt32(0));
+    NewPtr = Builder.CreateGEP(NewPtr, Builder.getInt32(-Index));
----------------
mzolotukhin wrote:
> I'm not sure it's correct if we vectorize *and* unroll the loop. Don't we need to adjust the pointer according to which unrolling-part we are at (instead of using `0`)?
Yes, we need.
As the above comment says, we get vectors consist of pointers in all unroll parts.

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:1904-1906
@@ +1903,5 @@
+
+      // If this member has different type, cast it to a unified type.
+      if (StoredVec->getType() != SubVT)
+        StoredVec = Builder.CreateBitOrPointerCast(StoredVec, SubVT);
+
----------------
mzolotukhin wrote:
> Could you please provide an example when this triggers? I'm a bit concerned about legality of this cast.
A test case called @int_float_struct tests the interleaved load group:
     struct IntFloat {
         int a;
         float b;
      };

      void int_float_struct(struct IntFloat *A) {
         int SumA;
         float SumB;
         for (unsigned i = 0; i < 1024; i++)  {
           SumA += A[i].a;
           SumB += A[i].b;
         }
         SA = SumA;
         SB = SumB;
       }
You can find this case in "interleaved-accesses.ll". The int-float can be long-double, or any pointer types, as long as the structure has elements of the same size.

Actually I have better case:
    struct IntFloat {
      int a;
      float b;
    };

    void int_float_struct(struct IntFloat *A, struct IntFloat *B) {
        for (unsigned i = 0; i < 1024; i++)  {
          B[i].a = A[i].a + 1;
          B[i].b = A[i].b + 1.0;
        }
      }
Unfortunately, this case can not be vectorized as it can not pass dependence check. As we have following code in isDependent() (LoopAccessAnalysis.c):
    if (ATy != BTy) {
      DEBUG(dbgs() <<
            "LAA: ReadWrite-Write positive dependency with different types\n");
      return Dependence::Unknown;
    }
I think this is a bug. We don't need to check types as we will check dependences based on the memory object size. by removing such code I don't see any failures in benchmarks.
Anyway, as such code, I can not give a test for interleaved write group.

http://reviews.llvm.org/D9368

EMAIL PREFERENCES
  http://reviews.llvm.org/settings/panel/emailpreferences/