[PATCH] D34601: [X86][LLVM]Expanding Supports lowerInterleavedStore() in X86InterleavedAccess.

Wed Jul 5 17:07:20 PDT 2017

Farhana added inline comments.

================
Comment at: lib/Target/X86/X86InterleavedAccess.cpp:170
+/// Generate unpacklo/unpackhi shuffle mask.
+static void createUnpackShuffleMask(int NumElts, SmallVectorImpl<uint32_t> &Mask,
+                                    bool Lo, bool Unary) {
----------------
Why not use the createUnpackShuffleMask(..) defined in X86ISelLowering.cpp? I would recommend you to declare a template function of this in X86ISelLowering.h which will allow both int and int32_t mask and get rid of this from X86InterleavedAccess.cpp.

================
Comment at: lib/Target/X86/X86InterleavedAccess.cpp:186
+//  shufle(VEC1,VEC2,{NumElement/2, NumElement/2+1, NumElement/2+2...,
+//                    NumElement-1, NumElement-NumElement/2,
+//                    NumElement-NumElement/2+1..., 2*NumElement-1})
----------------
The comment does not match the coding logic. Something is off here. May be it would be NumElement+NumElement/2?

================
Comment at: lib/Target/X86/X86InterleavedAccess.cpp:205
+
+void X86InterleavedAccessGroup::transposeChar_32x4(
+    ArrayRef<Instruction *> Matrix,
----------------
I would think you could write a more generate transpose function, a function of 8 elements which would scale into 16 and 32. Why is it written only for 32 VLen?

================
Comment at: lib/Target/X86/X86InterleavedAccess.cpp:384
+    // VecInst contains the Ptr argument.
+    Value *VecInst = Inst->getOperand(1);
+    Type *IntOf16 = Type::getInt16Ty(Shuffles[0]->getContext());
----------------
Value *VecInst = SI->getPointerOperand()

================
Comment at: lib/Target/X86/X86InterleavedAccess.cpp:384
+    // VecInst contains the Ptr argument.
+    Value *VecInst = Inst->getOperand(1);
+    Type *IntOf16 = Type::getInt16Ty(Shuffles[0]->getContext());
----------------
Farhana wrote:
> Value *VecInst = SI->getPointerOperand()
What is the reason for generating for stores instead of 1 wide store? I would think codegen is smart enough to generate expected 4 optimized stores here. Can you please keep it consistent with the other case which means do the concatenation of the transposed vectors and generate 1 wide store.

StoreInst *SI = cast<StoreInst>(Inst);
Value *BasePtr = SI->getPointerOperand();

 case 32: {
    transposeChar_32x4(DecomposedVectors, TransposedVectors);
    // VecInst contains the Ptr argument.
    Value *VecInst = SI->getPointerOperand();
    Type *IntOf16 = Type::getInt16Ty(Shuffles[0]->getContext());
    // From <128xi8>* to <64xi16>*
    Type *VecTran = VectorType::get(IntOf16, 64)->getPointerTo();
    BasePtr = Builder.CreateBitCast(VecInst, VecTran);

And move the following two statements out of the switch.

    //   3. Concatenate the contiguous-vectors back into a wide vector.
    Value *WideVec = concatenateVectors(Builder, TransposedVectors);

    //   4. Generate a store instruction for wide-vec.
    Builder.CreateAlignedStore(WideVec, BasePtr, // replace this with 
                               SI->getAlignment());

================
Comment at: lib/Target/X86/X86InterleavedAccess.cpp:389
+    Value *VecBasePtr = Builder.CreateBitCast(VecInst, VecTran);
+    for (unsigned i = 0; i < 4; i++) {
+      Value *NewBasePtr = Builder.CreateGEP(VecBasePtr, Builder.getInt32(i));
----------------
++i

https://reviews.llvm.org/D34601