[PATCH] [x86] Begin a significant overhaul of how vector lowering is done in the x86 backend.

Fri Jun 20 12:55:24 PDT 2014

This is fantastic. Some comments inline. Mostly of the shed color choice variety that you can ignore freely if you prefer.

I really like the overall direction of this patch. It takes a very systematic approach to breaking down the problem.

In the commit message, you mention needing a later peephole to combine some combinations of instructions. Would it make sense to call some of those sequences out explicitly in the code comments as well? Since this code is going to be relying on other things in order to get performant results, I'd like to see that called out so that anyone later reading this code can see where else they should look to fully understand how things are happening.

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:6761
@@ +6760,3 @@
+// This is a new code path for lowering vector shuffles on x86. It is designed
+// to handle arbitrary vector shuffles and blends, gracefully degrading
+// performance as necessary. It works hard to recognize idiomatic shuffles and
----------------
Style nit. I prefer not to refer to things as "new" in code comments, as the comments tend to live on that way long after the new code becomes the old code.

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:6769
@@ +6768,3 @@
+  for (int i = 0, Size = Mask.size(); i < Size; ++i)
+    if (Mask[i] != -1 && Mask[i] != i)
+      return false;
----------------
You comment on this below, but IMO it's worth calling out explicitly here that UNDEF masks are required to be implemented as preserving the input lane and thus are considered a NOP.

This and isSingleInputShuffleMask() should probably be static methods of ShuffleVectorSDNode instead. That would be similar to how isSplatMask() is implemented (side note, that should probably switch to use an ArrayRef instead of an EVT). Then add simple non-static wrappers isNoopShuffle() and isSingleInputShuff() to the class which call down to the static versions using the instance's mask.

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:6781
@@ +6780,3 @@
+
+// NB: We rely heavily on "undef" masks preserving the input lane.
+static SDValue getV4ShuffleImmForMask(ArrayRef<int> Mask, SelectionDAG &DAG) {
----------------
"NB?"

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:6803
@@ +6802,3 @@
+  ArrayRef<int> Mask = SVOp->getMask();
+  assert(Mask.size() == 2 && "Unexpected mask size for v2 shuffle!");
+
----------------
Likewise, it may be useful to expose a direct method on ShuffleVectorSDNode for getting the lane count.

int getNumElements() const { return getValueType(0).getVectorNumElements(); }

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:6805
@@ +6804,3 @@
+
+  if (isSingleInputShuffleMask(Mask)) {
+    // Straight shuffle of a single input vector. Simulate this by passing
----------------
Example of the above. This would be a bit clearer if written as SVOp->isSingleInput().

Ditto on similar constructs later in the patch.

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:6807
@@ +6806,3 @@
+    // Straight shuffle of a single input vector. Simulate this by passing
+    // sharing the input and destination.
+    unsigned SHUFPDMask = (Mask[0] & 1u) | ((Mask[1] & 1u) << 1u);
----------------
does not parse: "by passing sharing the"?

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:6813
@@ +6812,3 @@
+  assert(Mask[0] >= 0 && Mask[0] < 2);
+  assert(Mask[1] >= 2);
+
----------------
Comment strings on the asserts.

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:6831
@@ +6830,3 @@
+    // onward this has a single fast instruction with no scary immediates. But
+    // we have to map the mask as it is actually a v4i32 shuffle instruction.
+    V1 = DAG.getNode(ISD::BITCAST, DL, MVT::v4i32, V1);
----------------
s/But we have to map the mask as it/We have to map the mask, as it/

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:6859
@@ +6858,3 @@
+  SDValue LowV = V1, HighV = V2;
+  int NewMask[4] = {Mask[0], Mask[1], Mask[2], Mask[3]};
+
----------------
space after "{" and before "}"

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:6864
@@ +6863,3 @@
+
+  if (NumV2Elements == 0)
+    // Straight shuffle of a single input vector. We pass the input vector to
----------------
Why not use the helper function you defined above to check for the shuffle being single input?

I guess because you want to use the element count directly in the following code?

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:6873
@@ +6872,3 @@
+    while (Mask[V2Index] < 4)
+      ++V2Index;
+    int V2AdjIndex = V2Index ^ 1;
----------------
std::find_if instead?

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:6874
@@ +6873,3 @@
+      ++V2Index;
+    int V2AdjIndex = V2Index ^ 1;
+
----------------
This is a neat trick for getting the other lane of the high half, but it isn't immediately obvious that's what it's doing. Either a ?: to make the selection explicit or a comment explaining the bitwise trick would be good.

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:6886
@@ +6885,3 @@
+      // blended with a V1 element. Regardless of how many other V1 elements we
+      // have, this will always require a two-step blend.
+      int V1Index = V2AdjIndex;
----------------
Big ask, but I'd love some ASCII art type examples of the transforms, in particular for those like this one that need more than one instruction. Not the complete set of possibilities; just a single example that's illustrative showing arrows for which lanes get moved where by which insn.

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:6917
@@ +6916,3 @@
+                          (Mask[0] >= 4 ? Mask[0] : Mask[1]) - 4,
+                          (Mask[2] >= 4 ? Mask[2] : Mask[3]) - 4};
+      V1 = DAG.getNode(X86ISD::SHUFP, DL, MVT::v4f32, V1, V2,
----------------
Spacing for "{" "}".

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:6965
@@ +6964,3 @@
+    SDLoc DL, SDValue V, MutableArrayRef<int> Mask,
+    const X86Subtarget *Subtarget, SelectionDAG &DAG) {
+  MutableArrayRef<int> LoMask = Mask.slice(0, 4);
----------------
This is a really tricky part of the patch. Can you add a comment describing the approach and algorithm to the function? Without a lot of very explicit exposition, I'm concerned about future "improvements" to this code path inadvertently breaking things.

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:6996
@@ +6995,3 @@
+  // with <=2 inputs to each half in each half. Once there, we can fall through
+  // to the generic code below.
+  auto balanceSides = [&](ArrayRef<int> ThreeInputs, int OneInput,
----------------
An example in the comment would really help here.

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:7219
@@ +7218,3 @@
+///
+/// This essentially test whether viewing the mask as an interleaving of two
+/// sub-sequences reduces the cross-input traffic of a blend operation. If so,
----------------
s/test/tests/

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:7254
@@ +7253,3 @@
+
+static SDValue lowerV8I16BasicBlendVectorShuffle(SDLoc DL, SDValue V1,
+                                                 SDValue V2,
----------------
Same thing as above. Function comment describing the approach will go a long way.

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:7579
@@ +7578,3 @@
+
+  // For integer vector shuffles, try to collapse try to collapse them into
+  // a shuffle of fewer lanes but wider integers. We cap this to not form
----------------
Only need to "try to collapse" once. ;)

http://reviews.llvm.org/D4225