[PATCH] [X86] tranform insertps to blendps when possible for better performance

Fri Feb 27 14:01:49 PST 2015

I'm not really sold on doing this as a target combine. Is there some reason we can't just produce the desired insertps or blendps when lowering? This doesn't seem likely to come up only after doing some other shuffle lowering, but maybe I'm not seeing why.


================
Comment at: lib/Target/X86/X86ISelLowering.cpp:22913
@@ -22906,4 +22912,3 @@
     // countS and just gets an f32 from that address.
-    unsigned DestIndex =
-        cast<ConstantSDNode>(N->getOperand(2))->getZExtValue() >> 6;
+    auto DestIndex = Imm8 >> 6;
     
----------------
Please use an explicit type. Its short and removes questions.

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:22915
@@ -22909,3 +22914,3 @@
     
-    Ld = NarrowVectorLoadToElement(cast<LoadSDNode>(Ld), DestIndex, DAG);
+    auto Ld = NarrowVectorLoadToElement(cast<LoadSDNode>(N1), DestIndex, DAG);
 
----------------
This use of auto hurts readability IMO.

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:22932-22935
@@ +22931,6 @@
+
+  // FIXME: If optimizing for size and there is a load folding opportunity,
+  // we should either not do this transform or we should undo it in
+  // PerformBLENDICombine. The above check for "MayFoldLoad" doesn't work
+  // because it doesn't look through a SCALAR_TO_VECTOR node.
+  
----------------
I think we need to fix this always, and I think we should just handle it here in the "may fold load" case.

The primary reason to use insertps is to fold a scalar load into the shuffle. Switching to blendps is a huge mistake there because new we have to do some scalar -> vector first. In many cases, we may end up emitting insertps+blendps or movss+blendps which doesn't seem like the right lowering to me.

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:22937-22959
@@ -22920,2 +22936,25 @@
+  
+  switch (Imm8) {
+    default: return SDValue();
+
+    // The insertps immediate for the register/register variant is:
+    // Bits [7:6] - select exactly one of four 32-bit source lanes
+    // Bits [5:4] - select exactly one of four 32-bit destination lanes
+    // Bits [3:0] - zero mask bonus operation
+
+    // The blendps immediate is:
+    // Bits [3:0] - if a bit is set, copy the 32-bit source lane to the
+    // corresponding destination lane.
+
+    // To do this transform, we need the source select bits [7:6] to match
+    // the destination select bits [5:4] and zero mask bits [3:0] must be off.
+
+    case 0x00: Imm8 = 0x01; break; // copy src bits [31:0] to dest
+    case 0x50: Imm8 = 0x02; break; // copy src bits [63:32] to dest
+    case 0xA0: Imm8 = 0x04; break; // copy src bits [95:64] to dest
+    case 0xF0: Imm8 = 0x08; break; // copy src bits [127:96] to dest
+  }
+  SDValue NewMask = DAG.getConstant(Imm8, MVT::i8);
+  return DAG.getNode(X86ISD::BLENDI, dl, VT, N0, N1, NewMask);
 }
 
----------------
Can you use something like getTargetShuffleMask to handle all of this?

http://reviews.llvm.org/D7866

EMAIL PREFERENCES
  http://reviews.llvm.org/settings/panel/emailpreferences/