[PATCH] [LoopVectorize]Teach Loop Vectorizer about interleaved memory access

Hao Liu Hao.Liu at arm.com
Sun May 24 07:43:32 PDT 2015


================
Comment at: lib/Analysis/LoopAccessAnalysis.cpp:988
@@ +987,3 @@
+
+    for (auto II = std::next(I); II != E; ++II) {
+      Instruction *B = II->first;
----------------
mzolotukhin wrote:
> HaoLiu wrote:
> > HaoLiu wrote:
> > > mzolotukhin wrote:
> > > > I think you missed my question earlier on this part.
> > > > 
> > > > This algorithm is clearly quadratic, which is dangerous for compile time. We need to either make it linear, or add some limits (e.g. if number of accesses > 100, give up and don't even try to group them).
> > > > 
> > > > Also, if we separate it into two phases: 1) grouping accesses, 2) finding write-after-write pairs, I think we can make it more efficient. Since it's sufficient to compare against any access from a group, we'll be quadratic on the number of groups, not the number of accesses. And when we look for WaW pairs, we can only look only inside a group, not across all accesses in the loop. Does it sound reasonable, or am I missing something?
> > > > 
> > > For the first question. Previously, I thought you meant to improve the insert member method.
> > > Anyway, I think the algorithm is quadratic, the paper says:
> > >     This relies on the pivoting order in which the pairs of accesses are processed and  
> > >     reduces the complexity of the grouping algorithm: we iterate over O(n^2 ) pairs of
> > >     loads and stores (where n is the number of loads and stores)
> > > 
> > > But I think we don't need to worry about compile time:
> > >     (1) The "n" is only the number of stride loads/stores (|Stride| > 1)
> > >     (2) For a case with a lot of loads and stores, it possiblely can not pass previous dependence check, because it could have a lot of runtime memory check or have memory conflict. So it returns early.
> > > 
> > > On the other hand, if a loop indeed can pass dependence check. say if it has 100 loads, we should analyze the interleave accesses as vectorization on the possible interleave groups can really give us a lot of benefit.
> > For the second question. I think it is slightly different. We should not only consider about Group and Group, a group and a single store may break the dependence.
> > Say,   A[i] = a;          // (1)
> >           B[i] = b;          // (2)
> >           A[i+1] = c;      // (3)
> > The combine of (1) and (3) may break the dependence on "A[i] - B[i]" pair even though B[i] is not in a group.
> > So I think we should search all the write-write pairs. If we separate into two phases, we'll have to calculate the distance again. I think it may increase the compile time.
> Hm.. I'm reading the paper from here:
> http://domino.research.ibm.com/library/cyberdig.nsf/papers/EFD6206D4D9A9186852570D2005D9373/$File/H-0235.pdf
> And it says:
> ```
> This relies on the pivoting order in which pairs of accesses are processed, and implies that the complexity of the grouping algorithm is linear in the number of loads and stores.
> ```
> 
> I'm not sure if it's safe to assume that the dependence check won't be passed. E.g. we might have the following case:
> ```
> struct A {
> int x,y;
> };
> struct A points[100];
> for (int iter = 0; iter < 10000; iter++) { // not to be unrolled
>   for (int i = 0; i < 100; i++) { // this loop will be completely unrolled
>     points[i].x++;
>   }
>   // some other computations on the points
> }
> ```
> When the innermost loop is unrolled, we get 100 interleaving accesses, which will require 10^4 pairwise comparisons. I agree that we theoretically can optimize it better if we are not limited on time. However, if a programmer decides to increase number of points by a factor of 10, the compile time might slow down 100x, which won't be acceptable.
> I think we have a lot of optimizations that could theoretically find the best solution if they are allowed to consume infinite number of time, however, we usually have some thresholds for them (such thresholds often don't prevent one from catching the cases the optimization is aimed for, but they guard against rare but very nasty cases when the compiler might work for hours).
> 
> For the second question, my understanding is that we should have two groups: {A[i], A[i+1]} and {B[i]}. Then we check if an access from the first group can alias with an access from the second group - I think we don't have to check every access in a group for this kind of reasoning. I.e. if leaders of the groups might alias, then any member of the first group can alias with any member of the second group (and if the leaders don't alias, then we can probably assume that the groups are totally independent of each other). Am I missing something here?
I agree with you. I could also find many similar thresholds. I'll  add such limitation to the patch.

I think you are right for the second question. I'll try to implement this in the next patch.

http://reviews.llvm.org/D9368

EMAIL PREFERENCES
  http://reviews.llvm.org/settings/panel/emailpreferences/






More information about the llvm-commits mailing list