[PATCH] D32658: Supports lowerInterleavedStore() in X86InterleavedAccess.

Wed May 24 15:48:50 PDT 2017

Farhana added inline comments.

================
Comment at: lib/Target/X86/X86InterleavedAccess.cpp:115

   return true;
 }
----------------
DavidKreitzer wrote:
> This routine is making some assumptions about the nature of the re-interleave ShuffleVectorInst. The optimized code sequence that you are emitting is only correct for a shuffle mask that is exactly {0, 4, 8, 12, 1, 5, 9, 13, 2, 6, 10, 14, 3, 7, 11, 15}. That is the common case, but InterleavedAccessPass is more general than that.
> 
> Consider the following example:
> 
> ```
> define void @store_factorf64_4(<16 x double>* %ptr, <4 x double> %v0, <4 x double> %v1, <4 x double> %v2, <4 x double> %v3) {
>   %s0 = shufflevector <4 x double> %v0, <4 x double> %v1, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
>   %s1 = shufflevector <4 x double> %v2, <4 x double> %v3, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
>   %interleaved.vec = shufflevector <8 x double> %s0, <8 x double> %s1, <16 x i32> <i32 12, i32 8, i32 4, i32 0, i32 13, i32 9, i32 5, i32 1, i32 14, i32 10, i32 6, i32 2, i32 15, i32 11, i32 7, i32 3>
>   store <16 x double> %interleaved.vec, <16 x double>* %ptr, align 16
>   ret void
> }
> ```
> This is identical to your store_factorf64_4 test case except that I've changed the mask for %interleaved.vec. With the current implementation, this example gets optimized and is compiled to the exact same code as your original test case, which is obviously incorrect behavior.
> 
Thanks Dave for pointing this out.

================
Comment at: lib/Target/X86/X86InterleavedAccess.cpp:198
+  Type *ShuffleEltTy = ShuffleTy->getVectorElementType();
+  VectorType *SubVecTy = VectorType::get(ShuffleEltTy, Factor);

----------------
DavidKreitzer wrote:
> I think it is incorrect to be using Factor here.  SubVecTy is not <Factor x ShuffleEltTy> but rather <(Wide Vector # Elems / Factor) x ShuffleEltTy>. This only works because both computations evaluate to 4 in the cases you currently support.
> 
> To illustrate what I mean, suppose instead we were optimizing 4 interleaved i8 vectors with a wide vector type of <128 x i8>. In this case, the SubVecTy should be <32 x i8>, but your current code will compute it as <4 x i8>.
These functions were making some assumptions. I agree that they should not have done so.

https://reviews.llvm.org/D32658