[PATCH] D32658: Supports lowerInterleavedStore() in X86InterleavedAccess.

Fri May 26 08:49:02 PDT 2017

DavidKreitzer added inline comments.

================
Comment at: lib/Target/X86/X86InterleavedAccess.cpp:202
+    // Try to generate target-sized register(/instruction).
+    decompose(Inst, Factor, ShuffleTy, DecomposedVectors);
+
----------------
good catch!

================
Comment at: lib/Target/X86/X86InterleavedAccess.cpp:218
+  Type *ShuffleEltTy = ShuffleTy->getVectorElementType();
+  unsigned NumSubVecElems = Shuffles[0]->getType()->getVectorNumElements() / Factor;
+
----------------
Use ShuffleTy instead of Shuffles[0]->getType()

================
Comment at: lib/Target/X86/X86InterleavedAccess.cpp:115

   return true;
 }
----------------
Farhana wrote:
> DavidKreitzer wrote:
> > This routine is making some assumptions about the nature of the re-interleave ShuffleVectorInst. The optimized code sequence that you are emitting is only correct for a shuffle mask that is exactly {0, 4, 8, 12, 1, 5, 9, 13, 2, 6, 10, 14, 3, 7, 11, 15}. That is the common case, but InterleavedAccessPass is more general than that.
> > 
> > Consider the following example:
> > 
> > ```
> > define void @store_factorf64_4(<16 x double>* %ptr, <4 x double> %v0, <4 x double> %v1, <4 x double> %v2, <4 x double> %v3) {
> >   %s0 = shufflevector <4 x double> %v0, <4 x double> %v1, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
> >   %s1 = shufflevector <4 x double> %v2, <4 x double> %v3, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
> >   %interleaved.vec = shufflevector <8 x double> %s0, <8 x double> %s1, <16 x i32> <i32 12, i32 8, i32 4, i32 0, i32 13, i32 9, i32 5, i32 1, i32 14, i32 10, i32 6, i32 2, i32 15, i32 11, i32 7, i32 3>
> >   store <16 x double> %interleaved.vec, <16 x double>* %ptr, align 16
> >   ret void
> > }
> > ```
> > This is identical to your store_factorf64_4 test case except that I've changed the mask for %interleaved.vec. With the current implementation, this example gets optimized and is compiled to the exact same code as your original test case, which is obviously incorrect behavior.
> > 
> Thanks Dave for pointing this out.
I like the way you fixed this, thanks!

================
Comment at: lib/Target/X86/X86InterleavedAccess.cpp:270
+  for (unsigned i = 0; i < Factor; i++)
+    Indices.push_back(i);
+
----------------
DavidKreitzer wrote:
> This array of indices is meaningless for the store optimization, right? Why not simply pass an empty ArrayRef to the X86InterleavedAccessGroup constructor?
You didn't address this comment. Maybe the best change here would be to use "Indices.push_back(Mask[i]);" That would be consistent with the change you made at line 141:
```
      DecomposedVectors.push_back(
          cast<ShuffleVectorInst>(Builder.CreateShuffleVector(
              Op0, Op1, createSequentialMask(Builder, Mask[i],
                                             SubVecTy->getVectorNumElements(), 0))));
```

================
Comment at: test/CodeGen/X86/x86-interleaved-access.ll:143
+
+define void @store_factorf64_4(<16 x double>* %ptr, <4 x double> %v0, <4 x double> %v1, <4 x double> %v2, <4 x double> %v3) {
+; AVX-LABEL: store_factorf64_4:
----------------
RKSimon wrote:
> Please add these new tests to trunk now with the current codegen so your patch shows the diff.
This a good suggestion, Farhana. Can you please do this in a separate initial patch?

https://reviews.llvm.org/D32658