[PATCH] [LoopVectorize]Teach Loop Vectorizer about interleaved memory access
Michael Zolotukhin
mzolotukhin at apple.com
Fri May 22 11:40:40 PDT 2015
================
Comment at: lib/Analysis/LoopAccessAnalysis.cpp:988
@@ +987,3 @@
+
+ for (auto II = std::next(I); II != E; ++II) {
+ Instruction *B = II->first;
----------------
HaoLiu wrote:
> HaoLiu wrote:
> > mzolotukhin wrote:
> > > I think you missed my question earlier on this part.
> > >
> > > This algorithm is clearly quadratic, which is dangerous for compile time. We need to either make it linear, or add some limits (e.g. if number of accesses > 100, give up and don't even try to group them).
> > >
> > > Also, if we separate it into two phases: 1) grouping accesses, 2) finding write-after-write pairs, I think we can make it more efficient. Since it's sufficient to compare against any access from a group, we'll be quadratic on the number of groups, not the number of accesses. And when we look for WaW pairs, we can only look only inside a group, not across all accesses in the loop. Does it sound reasonable, or am I missing something?
> > >
> > For the first question. Previously, I thought you meant to improve the insert member method.
> > Anyway, I think the algorithm is quadratic, the paper says:
> > This relies on the pivoting order in which the pairs of accesses are processed and
> > reduces the complexity of the grouping algorithm: we iterate over O(n^2 ) pairs of
> > loads and stores (where n is the number of loads and stores)
> >
> > But I think we don't need to worry about compile time:
> > (1) The "n" is only the number of stride loads/stores (|Stride| > 1)
> > (2) For a case with a lot of loads and stores, it possiblely can not pass previous dependence check, because it could have a lot of runtime memory check or have memory conflict. So it returns early.
> >
> > On the other hand, if a loop indeed can pass dependence check. say if it has 100 loads, we should analyze the interleave accesses as vectorization on the possible interleave groups can really give us a lot of benefit.
> For the second question. I think it is slightly different. We should not only consider about Group and Group, a group and a single store may break the dependence.
> Say, A[i] = a; // (1)
> B[i] = b; // (2)
> A[i+1] = c; // (3)
> The combine of (1) and (3) may break the dependence on "A[i] - B[i]" pair even though B[i] is not in a group.
> So I think we should search all the write-write pairs. If we separate into two phases, we'll have to calculate the distance again. I think it may increase the compile time.
Hm.. I'm reading the paper from here:
http://domino.research.ibm.com/library/cyberdig.nsf/papers/EFD6206D4D9A9186852570D2005D9373/$File/H-0235.pdf
And it says:
```
This relies on the pivoting order in which pairs of accesses are processed, and implies that the complexity of the grouping algorithm is linear in the number of loads and stores.
```
I'm not sure if it's safe to assume that the dependence check won't be passed. E.g. we might have the following case:
```
struct A {
int x,y;
};
struct A points[100];
for (int iter = 0; iter < 10000; iter++) { // not to be unrolled
for (int i = 0; i < 100; i++) { // this loop will be completely unrolled
points[i].x++;
}
// some other computations on the points
}
```
When the innermost loop is unrolled, we get 100 interleaving accesses, which will require 10^4 pairwise comparisons. I agree that we theoretically can optimize it better if we are not limited on time. However, if a programmer decides to increase number of points by a factor of 10, the compile time might slow down 100x, which won't be acceptable.
I think we have a lot of optimizations that could theoretically find the best solution if they are allowed to consume infinite number of time, however, we usually have some thresholds for them (such thresholds often don't prevent one from catching the cases the optimization is aimed for, but they guard against rare but very nasty cases when the compiler might work for hours).
For the second question, my understanding is that we should have two groups: {A[i], A[i+1]} and {B[i]}. Then we check if an access from the first group can alias with an access from the second group - I think we don't have to check every access in a group for this kind of reasoning. I.e. if leaders of the groups might alias, then any member of the first group can alias with any member of the second group (and if the leaders don't alias, then we can probably assume that the groups are totally independent of each other). Am I missing something here?
================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:1823
@@ +1822,3 @@
+ unsigned Index = Group->getIndex(Instr);
+ for (unsigned Part = 0; Part < UF; Part++) {
+ // Notice current instruction could be any index. Need to adjust the address
----------------
HaoLiu wrote:
> mzolotukhin wrote:
> > Should we iterate to `VF` instead of `UF` here? `PtrParts` contains `VF` elements, right?
> Slightly different.
> 'PtrParts' contains UF vectors, and each vector has VF elements.
> Here, firstly we want the pointer vector in current unroll part, I.e. 'PtrParts[Part]'.
> Then for each pointer vector ''PtrParts[Part]', we want the pointer in lane 0.
>
> E.g.
> for (unsigned i = 0; i < 1024; i+=2) {
> A[i] = i;
> }
> If VF is 4 and UF is 2. Then we'll have two unroll parts:
> part 1:
> A[0] = 0;
> A[2] = 2;
> A[4] = 4;
> A[6] = 6;
>
> part 2:
> A[8] = 8;
> ...
> For this case, 'PtrParts' contains two vectors:
> Vector 1 consists pointers to: A[0], A[2], A[4], A[6]
> Vector 2 consists pointers to: A[8], A[10], A[12], A[14]
>
> Then we extract lane 0 from Vector 1 to get a pointer pointing to A[0], so that we can use it to load A[0-7] and after that, we can extract A[0, 2, 4, 6] from A[0-7].
>
> If the loads/stores are reverse, we need the pointer in last lane. E.g. the access sequences will be "A[1022], A[1020], A[1018], A[1016]", the last lane is the pointer to A[1016]". We can load A[1016-1023] to extract A[1016, 1018, 1020, 1022].
Thanks for the explanation, sounds good.
================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:1904-1906
@@ +1903,5 @@
+
+ // If this member has different type, cast it to a unified type.
+ if (StoredVec->getType() != SubVT)
+ StoredVec = Builder.CreateBitOrPointerCast(StoredVec, SubVT);
+
----------------
HaoLiu wrote:
> mzolotukhin wrote:
> > Could you please provide an example when this triggers? I'm a bit concerned about legality of this cast.
> A test case called @int_float_struct tests the interleaved load group:
> struct IntFloat {
> int a;
> float b;
> };
>
> void int_float_struct(struct IntFloat *A) {
> int SumA;
> float SumB;
> for (unsigned i = 0; i < 1024; i++) {
> SumA += A[i].a;
> SumB += A[i].b;
> }
> SA = SumA;
> SB = SumB;
> }
> You can find this case in "interleaved-accesses.ll". The int-float can be long-double, or any pointer types, as long as the structure has elements of the same size.
>
> Actually I have better case:
> struct IntFloat {
> int a;
> float b;
> };
>
> void int_float_struct(struct IntFloat *A, struct IntFloat *B) {
> for (unsigned i = 0; i < 1024; i++) {
> B[i].a = A[i].a + 1;
> B[i].b = A[i].b + 1.0;
> }
> }
> Unfortunately, this case can not be vectorized as it can not pass dependence check. As we have following code in isDependent() (LoopAccessAnalysis.c):
> if (ATy != BTy) {
> DEBUG(dbgs() <<
> "LAA: ReadWrite-Write positive dependency with different types\n");
> return Dependence::Unknown;
> }
> I think this is a bug. We don't need to check types as we will check dependences based on the memory object size. by removing such code I don't see any failures in benchmarks.
> Anyway, as such code, I can not give a test for interleaved write group.
It makes sense, thanks for the explanation!
http://reviews.llvm.org/D9368
EMAIL PREFERENCES
http://reviews.llvm.org/settings/panel/emailpreferences/
More information about the llvm-commits
mailing list