[PATCH] D34601: [X86][LLVM]Expanding Supports lowerInterleavedStore() in X86InterleavedAccess.

Wed Jul 12 06:41:59 PDT 2017

DavidKreitzer added a comment.

Thanks for the changes, Michael, and thanks for following up on the perf issue in https://bugs.llvm.org/show_bug.cgi?id=33740.

I have a few additional comments, but this generally looks good. The simplification of the code in X86InterleavedAccessGroup::lowerIntoOptimizedSequence is great!

================
Comment at: lib/Target/X86/X86ISelLowering.h:1440
+  void createUnpackShuffleMask(MVT VT, SmallVectorImpl<T> &Mask, bool Lo,
+    bool Unary) {
+    assert(Mask.empty() && "Expected an empty shuffle mask vector");
----------------
Fix the indenting here.

================
Comment at: lib/Target/X86/X86InterleavedAccess.cpp:73
                      SmallVectorImpl<Value *> &TrasposedVectors);
-
+  void transposeChar_32x4(ArrayRef<Instruction *> InputVectors,
+                          SmallVectorImpl<Value *> &TrasposedVectors);
----------------
DavidKreitzer wrote:
> "transpose" is a poor name here.  "interleave" would be better. Also, I would prefer "8bit" or "1byte" to "Char", e.g. interleave8bit_32x4.
> 
> "transpose" works for the 4x4 case (and other NxN cases), because the shuffle sequence does a matrix transpose on the input vectors, and the same code can be used for interleaving and de-interleaving. To handle the 32x8 load case, we would need a different code sequence than what you are currently generating in transposeChar_32x4. Presumably, we would use deinterleave8bit_32x4 for that.
> 
I see that you changed this to "deinterleave8bit_32x4" rather than "interleave8bit_32x4". Can you please explain why? This routine is taking 4 input vectors and merging their elements like this:

v0[0], v1[0], v2[0], v3[0], v0[1], v1[1], v2[1], v3[1], ...

Wouldn't you call that interleaving?

================
Comment at: lib/Target/X86/X86InterleavedAccess.cpp:113

-  if (!Subtarget.hasAVX() || ShuffleVecSize != ExpectedShuffleVecSize ||
-      DL.getTypeSizeInBits(ShuffleEltTy) != 64 || Factor != 4)
+  // Currently 8 bit of 32 elements support store oration only.
+  if (DL.getTypeSizeInBits(ShuffleEltTy) == 8 && !isa<StoreInst>(Inst))
----------------
The old wording here was better, and you have a typo in "oration".

================
Comment at: lib/Target/X86/X86InterleavedAccess.cpp:198
+void createUnpackShuffleMaskDType(MVT VT, SmallVectorImpl<T> &Mask, bool Lo,
+                                  bool Unary) {
+  assert(Mask.empty() && "Expected an empty shuffle mask vector");
----------------
Would it be possible to simplify the implementation of this routine by just calling createUnpackShuffleMask with an i16 type plus calling scaleShuffleMask with a scale of 2? (That would require you to move scaleShuffleMask into X86ISelLowering.h like you did with createUnpackShuffleMask.)

https://reviews.llvm.org/D34601