[PATCH] D34601: [X86][LLVM]Expanding Supports lowerInterleavedStore() in X86InterleavedAccess.

Thu Jul 6 13:50:13 PDT 2017

DavidKreitzer added a comment.

Hi Michael, sorry it took a while for me to get to this. I was on vacation last week.

================
Comment at: lib/Target/X86/X86InterleavedAccess.cpp:73
                      SmallVectorImpl<Value *> &TrasposedVectors);
-
+  void transposeChar_32x4(ArrayRef<Instruction *> InputVectors,
+                          SmallVectorImpl<Value *> &TrasposedVectors);
----------------
"transpose" is a poor name here.  "interleave" would be better. Also, I would prefer "8bit" or "1byte" to "Char", e.g. interleave8bit_32x4.

"transpose" works for the 4x4 case (and other NxN cases), because the shuffle sequence does a matrix transpose on the input vectors, and the same code can be used for interleaving and de-interleaving. To handle the 32x8 load case, we would need a different code sequence than what you are currently generating in transposeChar_32x4. Presumably, we would use deinterleave8bit_32x4 for that.

================
Comment at: lib/Target/X86/X86InterleavedAccess.cpp:74
+  void transposeChar_32x4(ArrayRef<Instruction *> InputVectors,
+                          SmallVectorImpl<Value *> &TrasposedVectors);
 public:
----------------
Trasposed --> Transposed. Can you fix this at line 72 as well?

================
Comment at: lib/Target/X86/X86InterleavedAccess.cpp:195
+                                    SmallVectorImpl<uint32_t> &Mask, bool Low) {
+  int BeginIndex = Low ? 0 : NumElement / 2;
+  int EndIndex = BeginIndex + NumElement / 2;
----------------
Perhaps change "BeginIndex" to "CurrentIndex" since you are updating it as you go.

================
Comment at: lib/Target/X86/X86InterleavedAccess.cpp:260
+
+  IntrVec1Low = Builder.CreateBitCast(
+      IntrVec1Low,
----------------
I think it is unnecessary and undesirable to do bitcasts here. It complicates both the IR and the code in X86InterleavedAccessGroup::lowerIntoOptimizedSequence, which now has to account for the interleave function returning a different type in "TransposedVectors" than the original "DecomposedVectors".

Instead, you just need to "scale" your <16 x i32> masks to <32 x i32> like this:

<a, b, ..., p> --> <a*2, a*2+1, b*2, b*2+1, ..., p*2, p*2+1>

There is an existing routine to do this scaling in X86ISelLowering.cpp, scaleShuffleMask. Also see canWidenShuffleElements, which is how the 32xi8 shuffle gets automatically converted to a 16xi16 shuffle by the CodeGen, ultimately causing the desired unpckwd instructions to be generated.

================
Comment at: lib/Target/X86/X86InterleavedAccess.cpp:384
+    // VecInst contains the Ptr argument.
+    Value *VecInst = Inst->getOperand(1);
+    Type *IntOf16 = Type::getInt16Ty(Shuffles[0]->getContext());
----------------
Farhana wrote:
> Farhana wrote:
> > Value *VecInst = SI->getPointerOperand()
> What is the reason for generating for stores instead of 1 wide store? I would think codegen is smart enough to generate expected 4 optimized stores here. Can you please keep it consistent with the other case which means do the concatenation of the transposed vectors and generate 1 wide store.
> 
> StoreInst *SI = cast<StoreInst>(Inst);
> Value *BasePtr = SI->getPointerOperand();
> 
>  case 32: {
>     transposeChar_32x4(DecomposedVectors, TransposedVectors);
>     // VecInst contains the Ptr argument.
>     Value *VecInst = SI->getPointerOperand();
>     Type *IntOf16 = Type::getInt16Ty(Shuffles[0]->getContext());
>     // From <128xi8>* to <64xi16>*
>     Type *VecTran = VectorType::get(IntOf16, 64)->getPointerTo();
>     BasePtr = Builder.CreateBitCast(VecInst, VecTran);
> 
> And move the following two statements out of the switch.
> 
>     //   3. Concatenate the contiguous-vectors back into a wide vector.
>     Value *WideVec = concatenateVectors(Builder, TransposedVectors);
> 
>     //   4. Generate a store instruction for wide-vec.
>     Builder.CreateAlignedStore(WideVec, BasePtr, // replace this with 
>                                SI->getAlignment());
> 
+1

================
Comment at: lib/Target/X86/X86InterleavedAccess.cpp:426
+// 1. stride4 x 64bit elements with vector factor 4 on AVX
+// 2. stride4 x 8bit elements with vector factor 32 on AVX2
 bool X86TargetLowering::lowerInterleavedStore(StoreInst *SI,
----------------
This comment is no longer accurate since you enabled this for AVX. It is probably okay to simply delete this comment since it is redundant with 105-107.

================
Comment at: test/CodeGen/X86/x86-interleaved-access.ll:199
 ; AVX1:       # BB#0:
-; AVX1-NEXT:    vpunpcklbw {{.*#+}} xmm4 = xmm2[0],xmm3[0],xmm2[1],xmm3[1],xmm2[2],xmm3[2],xmm2[3],xmm3[3],xmm2[4],xmm3[4],xmm2[5],xmm3[5],xmm2[6],xmm3[6],xmm2[7],xmm3[7]
-; AVX1-NEXT:    vpunpckhwd {{.*#+}} xmm5 = xmm0[4],xmm4[4],xmm0[5],xmm4[5],xmm0[6],xmm4[6],xmm0[7],xmm4[7]
-; AVX1-NEXT:    vpunpcklwd {{.*#+}} xmm4 = xmm0[0],xmm4[0],xmm0[1],xmm4[1],xmm0[2],xmm4[2],xmm0[3],xmm4[3]
-; AVX1-NEXT:    vinsertf128 $1, %xmm5, %ymm4, %ymm5
-; AVX1-NEXT:    vmovaps {{.*#+}} ymm4 = [65535,0,65535,0,65535,0,65535,0,65535,0,65535,0,65535,0,65535,0]
-; AVX1-NEXT:    vandnps %ymm5, %ymm4, %ymm5
-; AVX1-NEXT:    vpunpcklbw {{.*#+}} xmm6 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3],xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7]
-; AVX1-NEXT:    vpunpckhwd {{.*#+}} xmm7 = xmm6[4],xmm0[4],xmm6[5],xmm0[5],xmm6[6],xmm0[6],xmm6[7],xmm0[7]
-; AVX1-NEXT:    vpmovzxwd {{.*#+}} xmm6 = xmm6[0],zero,xmm6[1],zero,xmm6[2],zero,xmm6[3],zero
-; AVX1-NEXT:    vinsertf128 $1, %xmm7, %ymm6, %ymm6
-; AVX1-NEXT:    vandps %ymm4, %ymm6, %ymm6
-; AVX1-NEXT:    vorps %ymm5, %ymm6, %ymm8
-; AVX1-NEXT:    vpunpckhbw {{.*#+}} xmm6 = xmm2[8],xmm3[8],xmm2[9],xmm3[9],xmm2[10],xmm3[10],xmm2[11],xmm3[11],xmm2[12],xmm3[12],xmm2[13],xmm3[13],xmm2[14],xmm3[14],xmm2[15],xmm3[15]
-; AVX1-NEXT:    vpunpckhwd {{.*#+}} xmm7 = xmm0[4],xmm6[4],xmm0[5],xmm6[5],xmm0[6],xmm6[6],xmm0[7],xmm6[7]
-; AVX1-NEXT:    vpunpcklwd {{.*#+}} xmm6 = xmm0[0],xmm6[0],xmm0[1],xmm6[1],xmm0[2],xmm6[2],xmm0[3],xmm6[3]
-; AVX1-NEXT:    vinsertf128 $1, %xmm7, %ymm6, %ymm6
-; AVX1-NEXT:    vandnps %ymm6, %ymm4, %ymm6
-; AVX1-NEXT:    vpunpckhbw {{.*#+}} xmm7 = xmm0[8],xmm1[8],xmm0[9],xmm1[9],xmm0[10],xmm1[10],xmm0[11],xmm1[11],xmm0[12],xmm1[12],xmm0[13],xmm1[13],xmm0[14],xmm1[14],xmm0[15],xmm1[15]
-; AVX1-NEXT:    vpunpckhwd {{.*#+}} xmm5 = xmm7[4],xmm0[4],xmm7[5],xmm0[5],xmm7[6],xmm0[6],xmm7[7],xmm0[7]
-; AVX1-NEXT:    vpmovzxwd {{.*#+}} xmm7 = xmm7[0],zero,xmm7[1],zero,xmm7[2],zero,xmm7[3],zero
-; AVX1-NEXT:    vinsertf128 $1, %xmm5, %ymm7, %ymm5
-; AVX1-NEXT:    vandps %ymm4, %ymm5, %ymm5
-; AVX1-NEXT:    vorps %ymm6, %ymm5, %ymm9
-; AVX1-NEXT:    vextractf128 $1, %ymm3, %xmm3
-; AVX1-NEXT:    vextractf128 $1, %ymm2, %xmm2
+; AVX1-NEXT:    vpunpcklbw {{.*#+}} xmm9 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3],xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7]
+; AVX1-NEXT:    vextractf128 $1, %ymm1, %xmm5
----------------
The codegen improvements here look great!

https://reviews.llvm.org/D34601