[PATCH] Optimization for certain shufflevector by using insertps.

Thu Apr 24 07:20:42 PDT 2014

Hi filipe,

I think your patch in general looks good except for a few things (see comments below).

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:3933
@@ +3932,3 @@
+  // TODO: Deal with AVX's VINSERTPS
+  if (!VT.is128BitVector() || VT != MVT::v4f32)
+    return false;
----------------
I would change this into:


    if (VT != MVT::v4i32 && VT != MVT::v4f32)



================
Comment at: lib/Target/X86/X86ISelLowering.cpp:7283
@@ -7259,1 +7282,3 @@
 
+static SDValue getINSERTPS(ShuffleVectorSDNode *SVOp, SDLoc &dl,
+                           SelectionDAG &DAG, bool HasAVX) {
----------------
It probably makes sense to add a comment explaining that it is safe to call this function only when  the shuffle mask is a valid INSERTPS mask (according to your new isINSERTPSMask function). Otherwise, DestIndex would be wrongly computed.

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:7284
@@ +7283,3 @@
+static SDValue getINSERTPS(ShuffleVectorSDNode *SVOp, SDLoc &dl,
+                           SelectionDAG &DAG, bool HasAVX) {
+  // Generate an insertps instruction when inserting an f32 from memory onto a
----------------
I would remove the bool 'HasAVX' since it is not used.
That extra argument can be added in future once we decide to optimize for AVX cases too.

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:7293
@@ +7292,3 @@
+  auto Mask = SVOp->getMask();
+  assert(VT == MVT::v4f32 && "unsupported vector type for insertps");
+
----------------
You should also support MVT::v4i32.

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:7315-7336
@@ +7314,24 @@
+
+  if (MayFoldLoad(From)) {
+    // Trivial case, when From comes from a load and is only used by the
+    // shuffle. Make it use insertps from the vector that we need from that
+    // load.
+    SDValue Addr = From.getOperand(1);
+    SDValue NewAddr =
+        DAG.getNode(ISD::ADD, dl, Addr.getSimpleValueType(), Addr,
+                    DAG.getConstant(DestIndex * EVT.getStoreSize(),
+                                    Addr.getSimpleValueType()));
+
+    LoadSDNode *Load = cast<LoadSDNode>(From);
+    SDValue Ld = DAG.getLoad(EVT, dl, Load->getChain(), NewAddr,
+                             DAG.getMachineFunction().getMachineMemOperand(
+                                 Load->getMemOperand(), 0, EVT.getStoreSize()));
+
+    // Create this as a scalar to vector to match the instruction pattern.
+    SDValue LoadScalarToVector =
+        DAG.getNode(ISD::SCALAR_TO_VECTOR, dl, MVT::v4f32, Ld);
+    SDValue InsertpsMask = DAG.getIntPtrConstant(DestIndex << 4);
+    return DAG.getNode(X86ISD::INSERTPS, dl, VT, To, LoadScalarToVector,
+                       InsertpsMask);
+  }
+
----------------
As Elena pointed out,
it is unsafe to transform the load+insertps into an add+insertps with memory operand. If you force that transformation, then the alignment constraint would not be honored.

According to the instruction set reference, no alignment is required for the insertps instruction (unless alignment checking is enabled). If alignment checking is enabled, the check would ensure that the address is aligned to 4 (and not 16).

The good thing is that, even without that if-stmt, your transformation would still improve all the test cases you added (i.e. for the load+shuffle case, you would get a `movaps+insertps` instead of `movaps+shufps+shufps`).

================
Comment at: test/CodeGen/X86/sse41.ll:252-265
@@ -251,1 +251,16 @@
 
+define <4 x float> @insertps_from_shufflevector_1(<4 x float> %a, <4 x float>* nocapture readonly %pb) {
+entry:
+  %0 = load <4 x float>* %pb, align 16
+  %vecinit6 = shufflevector <4 x float> %a, <4 x float> %0, <4 x i32> <i32 0, i32 1, i32 2, i32 4>
+  ret <4 x float> %vecinit6
+; X32-LABEL: insertps_from_shufflevector_1:
+; X32-NOT: shufps
+; X32: insertps
+; X32: ret
+; X64-LABEL: insertps_from_shufflevector_1:
+; X64-NOT: shufps
+; X64: insertps
+; X64: ret
+}
+
----------------
Andrea Di Biagio wrote:
> Elena is right.
> Since you are matching the same code for X64 and X32, I would suggest adding a common check-prefix (example: --check-prefix=CHECK) to both RUN lines.
> 
> You can use a single CHECK-LABEL in each test function (rather than having two identical checks [X32-LABEL and X64-LABEL] for every function).
> Your new tests would only use CHECK (and not X32/X64).
> 
> I would add extra tests to verify that your lowering rule works fine with <4 x i32> vectors.
> 
Without the 'load+insertps -> add+insertps' transformation, this test would still be improved by your patch.

With your patch (excluding the if-stmt between lines 7315:7336 ), the code generate is better than it was before applying your patch. We now get a movaps+insertps; before we produced instead a movaps followed by two shufps instructions.

I would change the test adding explicit checks for the aligned load (movaps) and the insertps.

================
Comment at: test/CodeGen/X86/sse41.ll:252-279
@@ -251,1 +251,29 @@
 
+define <4 x float> @insertps_from_shufflevector_1(<4 x float> %a, <4 x float>* nocapture readonly %pb) {
+entry:
+  %0 = load <4 x float>* %pb, align 16
+  %vecinit6 = shufflevector <4 x float> %a, <4 x float> %0, <4 x i32> <i32 0, i32 1, i32 2, i32 4>
+  ret <4 x float> %vecinit6
+; X32-LABEL: insertps_from_shufflevector_1:
+; X32-NOT: shufps
+; X32: insertps
+; X32: ret
+; X64-LABEL: insertps_from_shufflevector_1:
+; X64-NOT: shufps
+; X64: insertps
+; X64: ret
+}
+
+define <4 x float> @insertps_from_shufflevector_2(<4 x float> %a, <4 x float> %b) {
+entry:
+  %vecinit6 = shufflevector <4 x float> %a, <4 x float> %b, <4 x i32> <i32 0, i32 1, i32 5, i32 3>
+  ret <4 x float> %vecinit6
+; X32-LABEL: insertps_from_shufflevector_2:
+; X32: insertps
+; X32-NOT: shufps
+; X32: ret
+; X64-LABEL: insertps_from_shufflevector_2:
+; X64: insertps
+; X64-NOT: shufps
+; X64: ret
+}
----------------
Elena is right.
Since you are matching the same code for X64 and X32, I would suggest adding a common check-prefix (example: --check-prefix=CHECK) to both RUN lines.

You can use a single CHECK-LABEL in each test function (rather than having two identical checks [X32-LABEL and X64-LABEL] for every function).
Your new tests would only use CHECK (and not X32/X64).

I would add extra tests to verify that your lowering rule works fine with <4 x i32> vectors.


================
Comment at: test/CodeGen/X86/sse41.ll:252-279
@@ -251,1 +251,29 @@
 
+define <4 x float> @insertps_from_shufflevector_1(<4 x float> %a, <4 x float>* nocapture readonly %pb) {
+entry:
+  %0 = load <4 x float>* %pb, align 16
+  %vecinit6 = shufflevector <4 x float> %a, <4 x float> %0, <4 x i32> <i32 0, i32 1, i32 2, i32 4>
+  ret <4 x float> %vecinit6
+; X32-LABEL: insertps_from_shufflevector_1:
+; X32-NOT: shufps
+; X32: insertps
+; X32: ret
+; X64-LABEL: insertps_from_shufflevector_1:
+; X64-NOT: shufps
+; X64: insertps
+; X64: ret
+}
+
+define <4 x float> @insertps_from_shufflevector_2(<4 x float> %a, <4 x float> %b) {
+entry:
+  %vecinit6 = shufflevector <4 x float> %a, <4 x float> %b, <4 x i32> <i32 0, i32 1, i32 5, i32 3>
+  ret <4 x float> %vecinit6
+; X32-LABEL: insertps_from_shufflevector_2:
+; X32: insertps
+; X32-NOT: shufps
+; X32: ret
+; X64-LABEL: insertps_from_shufflevector_2:
+; X64: insertps
+; X64-NOT: shufps
+; X64: ret
+}
----------------
Andrea Di Biagio wrote:
> Andrea Di Biagio wrote:
> > Elena is right.
> > Since you are matching the same code for X64 and X32, I would suggest adding a common check-prefix (example: --check-prefix=CHECK) to both RUN lines.
> > 
> > You can use a single CHECK-LABEL in each test function (rather than having two identical checks [X32-LABEL and X64-LABEL] for every function).
> > Your new tests would only use CHECK (and not X32/X64).
> > 
> > I would add extra tests to verify that your lowering rule works fine with <4 x i32> vectors.
> > 
> Without the 'load+insertps -> add+insertps' transformation, this test would still be improved by your patch.
> 
> With your patch (excluding the if-stmt between lines 7315:7336 ), the code generate is better than it was before applying your patch. We now get a movaps+insertps; before we produced instead a movaps followed by two shufps instructions.
> 
> I would change the test adding explicit checks for the aligned load (movaps) and the insertps.
Your parch improves the code generation also in the following case:



  define <4 x float> @foo(<4 x float> %a, float* %b) {
    %1 = load float* %b, align 4
    %2 = insertelement <4 x float> undef, float %1, i32 0
    %result = shufflevector <4 x float> %a, <4 x float> %2, <4 x i32> <i32 0, i32 1, i32 2, i32 4>
    ret <4 x i32> %result
  }
  


That is because your fix helps matching the pattern defined for `def rm : SS4AI8` that selects an INSERTPS with a memory operand. (See  X86InstrSSE.td - multiclass SS41I_insertf32).

With your patch, the backend would produce a single insertps for the test above.  Before (i.e. without your patch), we produced the following long sequence of instructions (on X64):


    movss  (%rdi), %xmm1
    movaps %xmm0, %xmm2
    movss %xmm1, %xmm2
    shufps $36, %xmm2, %xmm0
  
I suggest to add that test as well (and another test to verify that the code is simplifed even if we have <4 x i32> vector types).

http://reviews.llvm.org/D3475