[PATCH] D54468: [LoadStoreVectorizer] Fix infinite loop in reorder.

Tue Feb 12 19:21:47 PST 2019

rtereshin added inline comments.

================
Comment at: lib/Transforms/Vectorize/LoadStoreVectorizer.cpp:497
+          i--;
+          continue;
+        }
----------------
Hi Bevin,

Thanks for looking into this, apologies it took me so long to notice there is a review on me.

I don't think this is a reliable fix. It appears that it will only work if `SimplifyInstruction` is capable of simplifying-away all the "false" data dependencies between memory operations `Vectorizer::isConsecutiveAccess` is able to prove consecutive. I don't think there is anything (or should be) that guarantees such parity.

To illustrate the problem, let me start with a minimized version of the test you're adding in this patch, for which the `SimplifyInstruction`-based fix does work:
```lang=llvm
target triple = "aarch64"

; Function Attrs: noinline nounwind
define i32 @test([2 x i32]* %array) #0 {
entry:
  %arrayidx = getelementptr inbounds [2 x i32], [2 x i32]* %array, i32 0, i32 1
  %t = load i32, i32* %arrayidx, align 4
  %rem = urem i32 %t, 1
  %arrayidx2 = getelementptr inbounds [2 x i32], [2 x i32]* %array, i32 %rem, i32 0
  %v = load i32, i32* %arrayidx2, align 4
  %r = add i32 %v, %t
  ret i32 %r
}

attributes #0 = { noinline nounwind }
```
The false dependency is created by `%rem = urem i32 %t, 1` and it's trivial enough for `SimplifyInstruction` to constant fold, eliminating the dependency.

Here is the same test with the expression made more complex every so slightly:
```lang=llvm
target triple = "aarch64"

; Function Attrs: noinline nounwind
define i32 @test([2 x i32]* %array, i32 %idx) #0 {
entry:
  %ptr1 = getelementptr inbounds [2 x i32], [2 x i32]* %array, i32 %idx, i32 1
  %t = load i32, i32* %ptr1, align 4
  %t2 = mul i32 %t, 2
  %idx.p.2t = add i32 %idx, %t2
  %idx.p.t = sub i32 %idx.p.2t, %t
  %idx.another = sub i32 %idx.p.t, %t
  %ptr0 = getelementptr inbounds [2 x i32], [2 x i32]* %array, i32 %idx.another, i32 0
  %v = load i32, i32* %ptr0, align 4
  %r = add i32 %v, %t
  ret i32 %r
}

attributes #0 = { noinline nounwind }
``` 
The expression for `i32 %idx.another` is roughly `(((2 * %t) + %idx) - %t) - %t` and it's trivial enough for ScalarEvolution alone to see through:
```
  %ptr1 = getelementptr inbounds [2 x i32], [2 x i32]* %array, i32 %idx, i32 1
  -->  (4 + (8 * (sext i32 %idx to i64))<nsw> + %array) U: full-set S: full-set
  %t = load i32, i32* %ptr1, align 4
  -->  %t U: full-set S: full-set
  %t2 = mul i32 %t, 2
  -->  (2 * %t) U: [0,-1) S: [-2147483648,2147483647)
  %idx.p.2t = add i32 %idx, %t2
  -->  ((2 * %t) + %idx) U: full-set S: full-set
  %idx.p.t = sub i32 %idx.p.2t, %t
  -->  (%idx + %t) U: full-set S: full-set
  %idx.another = sub i32 %idx.p.t, %t
  -->  %idx U: full-set S: full-set
  %ptr0 = getelementptr inbounds [2 x i32], [2 x i32]* %array, i32 %idx.another, i32 0
  -->  ((8 * (sext i32 %idx to i64))<nsw> + %array)<nsw> U: full-set S: full-set
```
but it isn't trivial enough for `SimplifyInstruction` to eliminate the false dependency: LSV hangs on this example with and w/o this patch both.

If we pursue roughly the same approach, we could use ScalarEvolution Expander instead: if we re-materialize the SCEV of the pointer being used when we issue the vectorized version of the load or store (and keep `reorder` method as it currently is TOT), we could probably avoid the problem more reliably just because `Vectorizer::isConsecutiveAccess` is largely based on ScalarEvolution and there is more parity between what they can and can not do (note above the SCEV for `%ptr0`: it has the false dependency on `%t` fully eliminated. If re-materialized from SCEV, it will be just fine).

However, it won't do entirely either. `Vectorizer::isConsecutiveAccess` does more than just applying SE. Most notably, it can see through `select`s, even nested ones. Here's a test-case, which is mostly derived from the previous one:
```lang=llvm
target triple = "aarch64"

; Function Attrs: noinline nounwind
define i32 @test(i32* %array, i32 %idx, i1 %cond, i1 %cond2) #0 {
entry:
  %idx.p.7 = add nsw i32 %idx, 7
  %idx.p.7.sext = sext i32 %idx.p.7 to i64
  %ptr1.dummy = getelementptr inbounds i32, i32* %array, i64 21
  %ptr1.true = getelementptr inbounds i32, i32* %array, i64 1
  %ptr1.false = getelementptr inbounds i32, i32* %array, i64 %idx.p.7.sext
  %ptr1.tmp = select i1 %cond, i32* %ptr1.true, i32* %ptr1.false
  %ptr1 = select i1 %cond2, i32* %ptr1.tmp, i32* %ptr1.dummy
  %t = load i32, i32* %ptr1, align 4
  %t2 = mul i32 %t, 2
  %idx.p.2t = add i32 %idx, %t2
  %idx.p.t = sub i32 %idx.p.2t, %t
  %idx.another = sub i32 %idx.p.t, %t
  %idx.p.6 = add i32 %idx.another, 6
  %idx.p.6.sext = sext i32 %idx.p.6 to i64
  %ptr0.dummy = getelementptr inbounds i32, i32* %array, i64 20
  %ptr0.true = getelementptr inbounds i32, i32* %array, i64 0
  %ptr0.false = getelementptr inbounds i32, i32* %array, i64 %idx.p.6.sext
  %ptr0.tmp = select i1 %cond, i32* %ptr0.true, i32* %ptr0.false
  %ptr0 = select i1 %cond2, i32* %ptr0.tmp, i32* %ptr0.dummy
  %v = load i32, i32* %ptr0, align 4
  %r = add i32 %v, %t
  ret i32 %r
}

attributes #0 = { noinline nounwind }
```
As before, it hangs LSV both w/ and w/o this patch applied. What's more, however, in this case a top-level SCEV of the `%ptr0` is nowhere near being free of the false dependency on `%t`:
```
  %idx.p.7 = add nsw i32 %idx, 7
  -->  (7 + %idx) U: full-set S: full-set
  %idx.p.7.sext = sext i32 %idx.p.7 to i64
  -->  (sext i32 (7 + %idx) to i64) U: [-2147483648,2147483648) S: [-2147483648,2147483648)
  %ptr1.dummy = getelementptr inbounds i32, i32* %array, i64 21
  -->  (84 + %array)<nsw> U: full-set S: full-set
  %ptr1.true = getelementptr inbounds i32, i32* %array, i64 1
  -->  (4 + %array)<nsw> U: full-set S: full-set
  %ptr1.false = getelementptr inbounds i32, i32* %array, i64 %idx.p.7.sext
  -->  ((4 * (sext i32 (7 + %idx) to i64))<nsw> + %array)<nsw> U: full-set S: full-set
  %ptr1.tmp = select i1 %cond, i32* %ptr1.true, i32* %ptr1.false
  -->  %ptr1.tmp U: full-set S: full-set
  %ptr1 = select i1 %cond2, i32* %ptr1.tmp, i32* %ptr1.dummy
  -->  %ptr1 U: full-set S: full-set
  %t = load i32, i32* %ptr1, align 4
  -->  %t U: full-set S: full-set
  %t2 = mul i32 %t, 2
  -->  (2 * %t) U: [0,-1) S: [-2147483648,2147483647)
  %idx.p.2t = add i32 %idx, %t2
  -->  ((2 * %t) + %idx) U: full-set S: full-set
  %idx.p.t = sub i32 %idx.p.2t, %t
  -->  (%idx + %t) U: full-set S: full-set
  %idx.another = sub i32 %idx.p.t, %t
  -->  %idx U: full-set S: full-set
  %idx.p.6 = add i32 %idx.another, 6
  -->  (6 + %idx) U: full-set S: full-set
  %idx.p.6.sext = sext i32 %idx.p.6 to i64
  -->  (sext i32 (6 + %idx) to i64) U: [-2147483648,2147483648) S: [-2147483648,2147483648)
  %ptr0.dummy = getelementptr inbounds i32, i32* %array, i64 20
  -->  (80 + %array)<nsw> U: full-set S: full-set
  %ptr0.true = getelementptr inbounds i32, i32* %array, i64 0
  -->  %array U: full-set S: full-set
  %ptr0.false = getelementptr inbounds i32, i32* %array, i64 %idx.p.6.sext
  -->  ((4 * (sext i32 (6 + %idx) to i64))<nsw> + %array)<nsw> U: full-set S: full-set
  %ptr0.tmp = select i1 %cond, i32* %ptr0.true, i32* %ptr0.false
  -->  %ptr0.tmp U: full-set S: full-set
  %ptr0 = select i1 %cond2, i32* %ptr0.tmp, i32* %ptr0.dummy
```
So rematerializing just the top-level SCEV with the Expander won't help here either.

Repository:
  rL LLVM

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D54468/new/

https://reviews.llvm.org/D54468