[PATCH] [x86] Begin a significant overhaul of how vector lowering is done in the x86 backend.

Fri Jun 20 18:49:44 PDT 2014

Bike-sheddy comments addressed below. ;] I seem to like a somewhat different shade than you. Lemme know if its too different.

To the combine stuff, I'm not sure where would be a great place to put it. The point of the combines being needed is that there *isn't* one place where we make a potentially bad decision, but that two distant decisions could (in theory) collude to save an operation. Hard to comment that. The test cases should cover it very well though...

Will attach an updated patch momentarily...

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:6761
@@ +6760,3 @@
+// This is a new code path for lowering vector shuffles on x86. It is designed
+// to handle arbitrary vector shuffles and blends, gracefully degrading
+// performance as necessary. It works hard to recognize idiomatic shuffles and
----------------
Jim Grosbach wrote:
> Style nit. I prefer not to refer to things as "new" in code comments, as the comments tend to live on that way long after the new code becomes the old code.
OK, "experimental" maybe?

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:6769
@@ +6768,3 @@
+  for (int i = 0, Size = Mask.size(); i < Size; ++i)
+    if (Mask[i] != -1 && Mask[i] != i)
+      return false;
----------------
Jim Grosbach wrote:
> You comment on this below, but IMO it's worth calling out explicitly here that UNDEF masks are required to be implemented as preserving the input lane and thus are considered a NOP.
> 
> This and isSingleInputShuffleMask() should probably be static methods of ShuffleVectorSDNode instead. That would be similar to how isSplatMask() is implemented (side note, that should probably switch to use an ArrayRef instead of an EVT). Then add simple non-static wrappers isNoopShuffle() and isSingleInputShuff() to the class which call down to the static versions using the instance's mask.
Regarding the undef mask contract below, it doesn't impact this at all. An undef lane, by its nature, is a NOP?

I really dislike static methods for these at the moment. These have nothing to do with an SDNode and everything to do with the specific ways in which masks are used in this lowering code. I'm hesitant to over-generalize here.

Also, static methods are horrible to actually use.

If there is useful common logic to refactor into common locations, I'd rather do it if and when it is actually needed to implement common functionality. As it stands, it wouldn't save anything?

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:6781
@@ +6780,3 @@
+
+// NB: We rely heavily on "undef" masks preserving the input lane.
+static SDValue getV4ShuffleImmForMask(ArrayRef<int> Mask, SelectionDAG &DAG) {
----------------
David Majnemer wrote:
> Jim Grosbach wrote:
> > "NB?"
> It is probably an abbreviation for [[ http://en.wikipedia.org/wiki/Nota_bene | nota bene ]]
Correct. I don't care particularly about comment style; I'll probably clean this up along with adding the comment above.

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:6803
@@ +6802,3 @@
+  ArrayRef<int> Mask = SVOp->getMask();
+  assert(Mask.size() == 2 && "Unexpected mask size for v2 shuffle!");
+
----------------
Jim Grosbach wrote:
> Likewise, it may be useful to expose a direct method on ShuffleVectorSDNode for getting the lane count.
> 
> int getNumElements() const { return getValueType(0).getVectorNumElements(); }
I don't know that this makes the code any better. By directly asserting properties of the Mask, to me, it makes the following code that assumes the properties on the mask much more clear.

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:6805
@@ +6804,3 @@
+
+  if (isSingleInputShuffleMask(Mask)) {
+    // Straight shuffle of a single input vector. Simulate this by passing
----------------
Jim Grosbach wrote:
> Example of the above. This would be a bit clearer if written as SVOp->isSingleInput().
> 
> Ditto on similar constructs later in the patch.
I think this method might make sense on SVOp as it really isn't used anywhere on raw / hypothetical masks.

But at the same time, it was actually rather convenient to write all of this code in terms of the mask. I don't find the ShuffleVectorSDNode to add any utility. :: shrug ::

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:6807
@@ +6806,3 @@
+    // Straight shuffle of a single input vector. Simulate this by passing
+    // sharing the input and destination.
+    unsigned SHUFPDMask = (Mask[0] & 1u) | ((Mask[1] & 1u) << 1u);
----------------
Jim Grosbach wrote:
> does not parse: "by passing sharing the"?
English is hard. Will reword.

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:6813
@@ +6812,3 @@
+  assert(Mask[0] >= 0 && Mask[0] < 2);
+  assert(Mask[1] >= 2);
+
----------------
Jim Grosbach wrote:
> Comment strings on the asserts.
When there is a generally useful bit of information that is being asserted, I try to do so. However, these asserts are just documenting the mathematical invariants established by the above condition. There isn't any useful comment to give here IMO.

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:6831
@@ +6830,3 @@
+    // onward this has a single fast instruction with no scary immediates. But
+    // we have to map the mask as it is actually a v4i32 shuffle instruction.
+    V1 = DAG.getNode(ISD::BITCAST, DL, MVT::v4i32, V1);
----------------
Jim Grosbach wrote:
> s/But we have to map the mask as it/We have to map the mask, as it/
Done.

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:6859
@@ +6858,3 @@
+  SDValue LowV = V1, HighV = V2;
+  int NewMask[4] = {Mask[0], Mask[1], Mask[2], Mask[3]};
+
----------------
Jim Grosbach wrote:
> space after "{" and before "}"
See the coding guidelines and clang-format -- this is how it formats braces specifically to interoperate cleanly with braced init lists in C++11.

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:6864
@@ +6863,3 @@
+
+  if (NumV2Elements == 0)
+    // Straight shuffle of a single input vector. We pass the input vector to
----------------
Jim Grosbach wrote:
> Why not use the helper function you defined above to check for the shuffle being single input?
> 
> I guess because you want to use the element count directly in the following code?
Exactly. When I have to count anyways, I may as well re-use that.

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:6873
@@ +6872,3 @@
+    while (Mask[V2Index] < 4)
+      ++V2Index;
+    int V2AdjIndex = V2Index ^ 1;
----------------
Jim Grosbach wrote:
> std::find_if instead?
Sure. The first time I wrote this I was annoyed by re-computing the index from find_if and switched it to a while loop, but maybe the loop is too high cost.

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:6874
@@ +6873,3 @@
+      ++V2Index;
+    int V2AdjIndex = V2Index ^ 1;
+
----------------
Jim Grosbach wrote:
> This is a neat trick for getting the other lane of the high half, but it isn't immediately obvious that's what it's doing. Either a ?: to make the selection explicit or a comment explaining the bitwise trick would be good.
I have a distaste for comments which merely explain what the standard says the C++ in question does.

I tried to help this by naming the variable after the "adjacent" index. Is there a better variable name that would give the context? If not I can do something else...

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:6886
@@ +6885,3 @@
+      // blended with a V1 element. Regardless of how many other V1 elements we
+      // have, this will always require a two-step blend.
+      int V1Index = V2AdjIndex;
----------------
Jim Grosbach wrote:
> Big ask, but I'd love some ASCII art type examples of the transforms, in particular for those like this one that need more than one instruction. Not the complete set of possibilities; just a single example that's illustrative showing arrows for which lanes get moved where by which insn.
I'm hesitant to do this. Every time I drew an ascii art picture (and I drew several) or something on a white board, it actually messed me up because I started to silently assume that the one example I had was actually representative of the other ways to hit the same scenario.

Unless this is breaking your ability to reason about the code, I'm going to hold off investing a ton of time in crafting these...

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:6917
@@ +6916,3 @@
+                          (Mask[0] >= 4 ? Mask[0] : Mask[1]) - 4,
+                          (Mask[2] >= 4 ? Mask[2] : Mask[3]) - 4};
+      V1 = DAG.getNode(X86ISD::SHUFP, DL, MVT::v4f32, V1, V2,
----------------
Jim Grosbach wrote:
> Spacing for "{" "}".
Same comment, this is the clang-format enforced style and matches the coding standards.

I'll give a citation to help: http://llvm.org/docs/CodingStandards.html#braced-initializer-lists

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:6965
@@ +6964,3 @@
+    SDLoc DL, SDValue V, MutableArrayRef<int> Mask,
+    const X86Subtarget *Subtarget, SelectionDAG &DAG) {
+  MutableArrayRef<int> LoMask = Mask.slice(0, 4);
----------------
Jim Grosbach wrote:
> This is a really tricky part of the patch. Can you add a comment describing the approach and algorithm to the function? Without a lot of very explicit exposition, I'm concerned about future "improvements" to this code path inadvertently breaking things.
Yea, I didn't have that comment because there were many tens of different approaches tried before one worked. I'm also still hoping to simplify some parts of it.

Also, all the details are already documented in each branch of the hybrid approach.

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:6996
@@ +6995,3 @@
+  // with <=2 inputs to each half in each half. Once there, we can fall through
+  // to the generic code below.
+  auto balanceSides = [&](ArrayRef<int> ThreeInputs, int OneInput,
----------------
Jim Grosbach wrote:
> An example in the comment would really help here.
Hmm, ok. Not sure it'll actually help, but done. =] This trick took me weeks to figure out.

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:7219
@@ +7218,3 @@
+///
+/// This essentially test whether viewing the mask as an interleaving of two
+/// sub-sequences reduces the cross-input traffic of a blend operation. If so,
----------------
Jim Grosbach wrote:
> s/test/tests/
Don.

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:7254
@@ +7253,3 @@
+
+static SDValue lowerV8I16BasicBlendVectorShuffle(SDLoc DL, SDValue V1,
+                                                 SDValue V2,
----------------
Jim Grosbach wrote:
> Same thing as above. Function comment describing the approach will go a long way.
I've added them to most.

Note that these comments will only get more confusing as time goes on. =/ I'm not sure of a good way to document these. Each new ISA extension or trick I add will require updates. Hopefully I won't miss any. =]

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:7579
@@ +7578,3 @@
+
+  // For integer vector shuffles, try to collapse try to collapse them into
+  // a shuffle of fewer lanes but wider integers. We cap this to not form
----------------
Jim Grosbach wrote:
> Only need to "try to collapse" once. ;)
Done.

http://reviews.llvm.org/D4225