[PATCH] D19984: [LV] Handle RAW dependences in interleaved access analysis

Thu May 12 15:19:59 PDT 2016

mssimpso added a comment.

Hi Guys,

I've been thinking through these dependence issues more carefully. I'm not yet finished with all the permutations, but I've come across another somewhat unrelated issue. I thought I would post an update about that and respond to some of Adam's questions in the meantime. First the questions:

> No, I was asking about *non-zero* distance deps specifically. The current patch only handles zero-distance deps. So my question was whether for non-zero distance we still rely on the store-to-load forwarding detection code to make the dep unsafe for vectorization.

For positive dependences, LAA prevents vectorization if the distance is less than some minimum (which determines the maximum safe VF). For positive dependences between strided accesses, LAA tries to prove independence (the accesses are independent if the distance is not a multiple of the stride). Non-positive dependences are allowed, and LAA checks for store-to-load forwarding conflicts for RAWs. The original analysis assumed the store-to-load detection would guarantee the absence of all RAWs, which was incorrect. So we need to consider the non-positive RAWs (as Silviu said, we should be able to exclude the WAR and WAW cases).

> What about moving elements of an interleaved group over other dependent accesses not in the same group?

This would be the RAW case. For example, S1-L2 is a RAW dependence, L1-L2 form a group:

  L1: load
  S1: store
  L2: load  // L2 would be hoisted above S1.

Or alternatively, S1-L1 is a RAW dependence, S1-S2 form a group:

  S1: store // S1 would be sunk below L1.
  L1: load
  S2: store

The other issue I've uncovered is related to the maximum safe VF. We set this based on the positive dependence distance LAA computes. But for the interleaved accesses, the actual VF used during vectorization is VF * IF, where IF is the interleave factor. The idea is that each component of the wide vector would have VF elements after they are shuffled out. However, this could be greater than the maximum safe VF. Here's an example.

  ; for (int i = 1; i < 1000; ++i) {
  ;   p[i + 2].x = p[i].x;
  ;   p[i + 2].y = p[i].y;
  ; }
  %struct.pair = type { i32, i32 }
  for.body:
    %indvars.iv = phi i64 [ 1, %entry ], [ %indvars.iv.next, %for.body ]
    %x1 = getelementptr inbounds %struct.pair, %struct.pair* %p, i64 %indvars.iv, i32 0
    %0 = load i32, i32* %x1, align 4
    %1 = add nuw nsw i64 %indvars.iv, 2
    %x4 = getelementptr inbounds %struct.pair, %struct.pair* %p, i64 %1, i32 0
    store i32 %0, i32* %x4, align 4
    %y = getelementptr inbounds %struct.pair, %struct.pair* %p, i64 %indvars.iv, i32 1
    %2 = load i32, i32* %y, align 4
    %y10 = getelementptr inbounds %struct.pair, %struct.pair* %p, i64 %1, i32 1
    store i32 %2, i32* %y10, align 4
    %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
    %exitcond = icmp eq i64 %indvars.iv.next, 1000
    br i1 %exitcond, label %for.cond.cleanup, label %for.body
  }

We currently generate `<8 x i32>` loads and stores, but the maximum safe dependence distance is only 16 bytes, so I think we are generating incorrect code here. I think we should probably check for interleaved accesses when selecting the VF.

================
Comment at: test/Transforms/LoopVectorize/interleaved-accesses.ll:574-579
@@ +573,8 @@
+;
+; void PR27626(struct pair *p, int n) {
+;   for (int i = 0; i < n; i++) {
+;     p->x = p->y;
+;     p->y = p->x;
+;   }
+; }
+
----------------
anemet wrote:
> You mean p[i].x etc in the loop.
Yes, thanks for catching that!

http://reviews.llvm.org/D19984