[PATCH] D34601: [X86][LLVM]Expanding Supports lowerInterleavedStore() in X86InterleavedAccess.
David Kreitzer via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Thu Jul 6 13:50:13 PDT 2017
DavidKreitzer added a comment.
Hi Michael, sorry it took a while for me to get to this. I was on vacation last week.
================
Comment at: lib/Target/X86/X86InterleavedAccess.cpp:73
SmallVectorImpl<Value *> &TrasposedVectors);
-
+ void transposeChar_32x4(ArrayRef<Instruction *> InputVectors,
+ SmallVectorImpl<Value *> &TrasposedVectors);
----------------
"transpose" is a poor name here. "interleave" would be better. Also, I would prefer "8bit" or "1byte" to "Char", e.g. interleave8bit_32x4.
"transpose" works for the 4x4 case (and other NxN cases), because the shuffle sequence does a matrix transpose on the input vectors, and the same code can be used for interleaving and de-interleaving. To handle the 32x8 load case, we would need a different code sequence than what you are currently generating in transposeChar_32x4. Presumably, we would use deinterleave8bit_32x4 for that.
================
Comment at: lib/Target/X86/X86InterleavedAccess.cpp:74
+ void transposeChar_32x4(ArrayRef<Instruction *> InputVectors,
+ SmallVectorImpl<Value *> &TrasposedVectors);
public:
----------------
Trasposed --> Transposed. Can you fix this at line 72 as well?
================
Comment at: lib/Target/X86/X86InterleavedAccess.cpp:195
+ SmallVectorImpl<uint32_t> &Mask, bool Low) {
+ int BeginIndex = Low ? 0 : NumElement / 2;
+ int EndIndex = BeginIndex + NumElement / 2;
----------------
Perhaps change "BeginIndex" to "CurrentIndex" since you are updating it as you go.
================
Comment at: lib/Target/X86/X86InterleavedAccess.cpp:260
+
+ IntrVec1Low = Builder.CreateBitCast(
+ IntrVec1Low,
----------------
I think it is unnecessary and undesirable to do bitcasts here. It complicates both the IR and the code in X86InterleavedAccessGroup::lowerIntoOptimizedSequence, which now has to account for the interleave function returning a different type in "TransposedVectors" than the original "DecomposedVectors".
Instead, you just need to "scale" your <16 x i32> masks to <32 x i32> like this:
<a, b, ..., p> --> <a*2, a*2+1, b*2, b*2+1, ..., p*2, p*2+1>
There is an existing routine to do this scaling in X86ISelLowering.cpp, scaleShuffleMask. Also see canWidenShuffleElements, which is how the 32xi8 shuffle gets automatically converted to a 16xi16 shuffle by the CodeGen, ultimately causing the desired unpckwd instructions to be generated.
================
Comment at: lib/Target/X86/X86InterleavedAccess.cpp:384
+ // VecInst contains the Ptr argument.
+ Value *VecInst = Inst->getOperand(1);
+ Type *IntOf16 = Type::getInt16Ty(Shuffles[0]->getContext());
----------------
Farhana wrote:
> Farhana wrote:
> > Value *VecInst = SI->getPointerOperand()
> What is the reason for generating for stores instead of 1 wide store? I would think codegen is smart enough to generate expected 4 optimized stores here. Can you please keep it consistent with the other case which means do the concatenation of the transposed vectors and generate 1 wide store.
>
> StoreInst *SI = cast<StoreInst>(Inst);
> Value *BasePtr = SI->getPointerOperand();
>
> case 32: {
> transposeChar_32x4(DecomposedVectors, TransposedVectors);
> // VecInst contains the Ptr argument.
> Value *VecInst = SI->getPointerOperand();
> Type *IntOf16 = Type::getInt16Ty(Shuffles[0]->getContext());
> // From <128xi8>* to <64xi16>*
> Type *VecTran = VectorType::get(IntOf16, 64)->getPointerTo();
> BasePtr = Builder.CreateBitCast(VecInst, VecTran);
>
> And move the following two statements out of the switch.
>
> // 3. Concatenate the contiguous-vectors back into a wide vector.
> Value *WideVec = concatenateVectors(Builder, TransposedVectors);
>
> // 4. Generate a store instruction for wide-vec.
> Builder.CreateAlignedStore(WideVec, BasePtr, // replace this with
> SI->getAlignment());
>
+1
================
Comment at: lib/Target/X86/X86InterleavedAccess.cpp:426
+// 1. stride4 x 64bit elements with vector factor 4 on AVX
+// 2. stride4 x 8bit elements with vector factor 32 on AVX2
bool X86TargetLowering::lowerInterleavedStore(StoreInst *SI,
----------------
This comment is no longer accurate since you enabled this for AVX. It is probably okay to simply delete this comment since it is redundant with 105-107.
================
Comment at: test/CodeGen/X86/x86-interleaved-access.ll:199
; AVX1: # BB#0:
-; AVX1-NEXT: vpunpcklbw {{.*#+}} xmm4 = xmm2[0],xmm3[0],xmm2[1],xmm3[1],xmm2[2],xmm3[2],xmm2[3],xmm3[3],xmm2[4],xmm3[4],xmm2[5],xmm3[5],xmm2[6],xmm3[6],xmm2[7],xmm3[7]
-; AVX1-NEXT: vpunpckhwd {{.*#+}} xmm5 = xmm0[4],xmm4[4],xmm0[5],xmm4[5],xmm0[6],xmm4[6],xmm0[7],xmm4[7]
-; AVX1-NEXT: vpunpcklwd {{.*#+}} xmm4 = xmm0[0],xmm4[0],xmm0[1],xmm4[1],xmm0[2],xmm4[2],xmm0[3],xmm4[3]
-; AVX1-NEXT: vinsertf128 $1, %xmm5, %ymm4, %ymm5
-; AVX1-NEXT: vmovaps {{.*#+}} ymm4 = [65535,0,65535,0,65535,0,65535,0,65535,0,65535,0,65535,0,65535,0]
-; AVX1-NEXT: vandnps %ymm5, %ymm4, %ymm5
-; AVX1-NEXT: vpunpcklbw {{.*#+}} xmm6 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3],xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7]
-; AVX1-NEXT: vpunpckhwd {{.*#+}} xmm7 = xmm6[4],xmm0[4],xmm6[5],xmm0[5],xmm6[6],xmm0[6],xmm6[7],xmm0[7]
-; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm6 = xmm6[0],zero,xmm6[1],zero,xmm6[2],zero,xmm6[3],zero
-; AVX1-NEXT: vinsertf128 $1, %xmm7, %ymm6, %ymm6
-; AVX1-NEXT: vandps %ymm4, %ymm6, %ymm6
-; AVX1-NEXT: vorps %ymm5, %ymm6, %ymm8
-; AVX1-NEXT: vpunpckhbw {{.*#+}} xmm6 = xmm2[8],xmm3[8],xmm2[9],xmm3[9],xmm2[10],xmm3[10],xmm2[11],xmm3[11],xmm2[12],xmm3[12],xmm2[13],xmm3[13],xmm2[14],xmm3[14],xmm2[15],xmm3[15]
-; AVX1-NEXT: vpunpckhwd {{.*#+}} xmm7 = xmm0[4],xmm6[4],xmm0[5],xmm6[5],xmm0[6],xmm6[6],xmm0[7],xmm6[7]
-; AVX1-NEXT: vpunpcklwd {{.*#+}} xmm6 = xmm0[0],xmm6[0],xmm0[1],xmm6[1],xmm0[2],xmm6[2],xmm0[3],xmm6[3]
-; AVX1-NEXT: vinsertf128 $1, %xmm7, %ymm6, %ymm6
-; AVX1-NEXT: vandnps %ymm6, %ymm4, %ymm6
-; AVX1-NEXT: vpunpckhbw {{.*#+}} xmm7 = xmm0[8],xmm1[8],xmm0[9],xmm1[9],xmm0[10],xmm1[10],xmm0[11],xmm1[11],xmm0[12],xmm1[12],xmm0[13],xmm1[13],xmm0[14],xmm1[14],xmm0[15],xmm1[15]
-; AVX1-NEXT: vpunpckhwd {{.*#+}} xmm5 = xmm7[4],xmm0[4],xmm7[5],xmm0[5],xmm7[6],xmm0[6],xmm7[7],xmm0[7]
-; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm7 = xmm7[0],zero,xmm7[1],zero,xmm7[2],zero,xmm7[3],zero
-; AVX1-NEXT: vinsertf128 $1, %xmm5, %ymm7, %ymm5
-; AVX1-NEXT: vandps %ymm4, %ymm5, %ymm5
-; AVX1-NEXT: vorps %ymm6, %ymm5, %ymm9
-; AVX1-NEXT: vextractf128 $1, %ymm3, %xmm3
-; AVX1-NEXT: vextractf128 $1, %ymm2, %xmm2
+; AVX1-NEXT: vpunpcklbw {{.*#+}} xmm9 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3],xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7]
+; AVX1-NEXT: vextractf128 $1, %ymm1, %xmm5
----------------
The codegen improvements here look great!
https://reviews.llvm.org/D34601
More information about the llvm-commits
mailing list