[PATCH] D41794: [X86] Improve AVX1 shuffle lowering for v8f32 shuffles where the low half comes from V1 and the high half comes from V2 and the halves do the same operation

Tue Jul 31 04:42:19 PDT 2018

RKSimon added a comment.

Some minors and I'm a bit concerned about the regression in prefer-avx256-mask-shuffle.ll - not sure why it regresses so much.

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:13224
 /// in x86 only floating point has interesting non-repeating shuffles, and even
 /// those are still *marginally* more expensive.
 static SDValue lowerVectorShuffleByMerging128BitLanes(
----------------
This comment + FIXME needs updating

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:13269

-  // See if we can build a hypothetical 128-bit lane-fixing shuffle mask. Also
-  // check whether the in-128-bit lane shuffles share a repeating pattern.
-  SmallVector<int, 4> Lanes((unsigned)NumLanes, -1);
-  SmallVector<int, 4> InLaneMask((unsigned)LaneSize, -1);
-  for (int i = 0; i < Size; ++i) {
-    if (Mask[i] < 0)
+      InLaneMask[i] = (M % LaneSize) + Src * Size;
+    }
----------------
This seems easier to grok (at least to me):
```
int Src;
if (Srcs[0] < 0 || Srcs[0] == LaneSrc)
  Src = 0;
else if (Srcs[1] < 0 || Srcs[1] == LaneSrc)
  Src = 1;
else
  return SDValue();

Srcs[Src] = LaneSrc;
InLaneMask[i] = (M % LaneSize) + Src * Size;
```

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:13281
+      assert(M1.size() == M2.size() && "Unexpected mask size");
+      for (int i = 0; i != (int)M1.size(); ++i)
+        if (M1[i] >= 0 && M2[i] >= 0 && M1[i] != M2[i])
----------------
for (int i = 0, e = M1.size(); i != e; ++i)

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:13289
+      assert(Mask.size() == MergedMask.size() && "Unexpected mask size");
+      for (int i = 0; i != (int)MergedMask.size(); ++i) {
+        int M = Mask[i];
----------------
for (int i = 0, e = MergedMask.size(); i != e; ++i) {

================
Comment at: test/CodeGen/X86/prefer-avx256-mask-shuffle.ll:210
 ; AVX256VLBW-NEXT:    kmovd %eax, %k1
-; AVX256VLBW-NEXT:    vpshufb {{.*#+}} ymm0 {%k1} = ymm1[u,u,6,u,u,u,u,u,u,u,u,u,u,5,u,u,19,22,u,28,19,23,23,16,19,22,17,29,19,u,23,16]
+; AVX256VLBW-NEXT:    vmovdqu8 %ymm2, %ymm0 {%k1}
 ; AVX256VLBW-NEXT:    vpmovb2m %ymm0, %k0
----------------
This is the only notable regression - any idea why it breaks so badly?

https://reviews.llvm.org/D41794