[PATCH] D32658: Supports lowerInterleavedStore() in X86InterleavedAccess.

Wed May 10 10:26:08 PDT 2017

DavidKreitzer added inline comments.

================
Comment at: lib/Target/X86/X86InterleavedAccess.cpp:115

   return true;
 }
----------------
This routine is making some assumptions about the nature of the re-interleave ShuffleVectorInst. The optimized code sequence that you are emitting is only correct for a shuffle mask that is exactly {0, 4, 8, 12, 1, 5, 9, 13, 2, 6, 10, 14, 3, 7, 11, 15}. That is the common case, but InterleavedAccessPass is more general than that.

Consider the following example:

```
define void @store_factorf64_4(<16 x double>* %ptr, <4 x double> %v0, <4 x double> %v1, <4 x double> %v2, <4 x double> %v3) {
  %s0 = shufflevector <4 x double> %v0, <4 x double> %v1, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
  %s1 = shufflevector <4 x double> %v2, <4 x double> %v3, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
  %interleaved.vec = shufflevector <8 x double> %s0, <8 x double> %s1, <16 x i32> <i32 12, i32 8, i32 4, i32 0, i32 13, i32 9, i32 5, i32 1, i32 14, i32 10, i32 6, i32 2, i32 15, i32 11, i32 7, i32 3>
  store <16 x double> %interleaved.vec, <16 x double>* %ptr, align 16
  ret void
}
```
This is identical to your store_factorf64_4 test case except that I've changed the mask for %interleaved.vec. With the current implementation, this example gets optimized and is compiled to the exact same code as your original test case, which is obviously incorrect behavior.

================
Comment at: lib/Target/X86/X86InterleavedAccess.cpp:137
+    // Generate N shuffles of T type
+    for (unsigned i = 0; i < NumSubVectors; i++)
+      DecomposedVectors.push_back(
----------------
++i is preferred

================
Comment at: lib/Target/X86/X86InterleavedAccess.cpp:141
+              Op0, Op1,
+              createSequentialMask(Builder, i * Factor, NumSubVectors, 0))));
+    return;
----------------
This call to createSequentialMask looks incorrect. I think you want something like this:

```
unsigned NumSubVectorElements = getVectorNumElements(VecTy) / NumSubVectors;
createSequentialMask(Builder, i * NumSubVectorElements, NumSubVectorElements, 0);
```

What you have written only works because Factor, NumSubVectors, and NumSubVectorElements are all the same (i.e. 4) in the cases you currently support.

================
Comment at: lib/Target/X86/X86InterleavedAccess.cpp:198
+  Type *ShuffleEltTy = ShuffleTy->getVectorElementType();
+  VectorType *SubVecTy = VectorType::get(ShuffleEltTy, Factor);

----------------
I think it is incorrect to be using Factor here.  SubVecTy is not <Factor x ShuffleEltTy> but rather <(Wide Vector # Elems / Factor) x ShuffleEltTy>. This only works because both computations evaluate to 4 in the cases you currently support.

To illustrate what I mean, suppose instead we were optimizing 4 interleaved i8 vectors with a wide vector type of <128 x i8>. In this case, the SubVecTy should be <32 x i8>, but your current code will compute it as <4 x i8>.

================
Comment at: lib/Target/X86/X86InterleavedAccess.cpp:212
+    // transposed-interleaved vectors.
+    for (unsigned i = 0, e = Shuffles.size(); i < e; i++)
+      Shuffles[i]->replaceAllUsesWith(TransposedVectors[Indices[i]]);
----------------
++i is preferred

================
Comment at: lib/Target/X86/X86InterleavedAccess.cpp:218
+
+  // Lower a group of interleaved stores
+  // Decompose interleaved wide shuffle
----------------
Please add '.' at the end of each sentence.

================
Comment at: lib/Target/X86/X86InterleavedAccess.cpp:270
+  for (unsigned i = 0; i < Factor; i++)
+    Indices.push_back(i);
+
----------------
This array of indices is meaningless for the store optimization, right? Why not simply pass an empty ArrayRef to the X86InterleavedAccessGroup constructor?

https://reviews.llvm.org/D32658