[PATCH] D11518: [DAGCombiner] Convert constant AND masks to shuffle clear masks down to the byte level

Mon Jul 27 04:23:27 PDT 2015

RKSimon added a comment.

Thanks Chandler - I've replied to your comments and I'll get the 'sub patches' done first.


================
Comment at: lib/CodeGen/SelectionDAG/DAGCombiner.cpp:12991-12994
@@ -12990,8 +12990,6 @@
 
-    for (unsigned i = 0; i != NumElts; ++i) {
-      SDValue Elt = RHS.getOperand(i);
-      if (isAllOnesConstant(Elt))
-        Indices.push_back(i);
-      else if (isNullConstant(Elt))
-        Indices.push_back(NumElts+i);
-      else
+    // Attempt to create a valid clear mask, splitting the mask into
+    // sub elements and checking to see if each is
+    // all zeros or all ones - suitable for shuffle masking.
+    auto BuildClearMask = [&](unsigned Split) {
+      unsigned NumSubElts = NumElts * Split;
----------------
chandlerc wrote:
> How annoying would it be to fully split this out into a helper function? The indentation is already reasonably significant here.
Moderately annoying - the clear mask test is the only test in XformToShuffleWithZero. I could replace the if (RHS.getOpcode() == ISD::BUILD_VECTOR) block with a if (RHS.getOpcode() != ISD::BUILD_VECTOR) early out to reduce one level indentation instead?

================
Comment at: lib/CodeGen/SelectionDAG/DAGCombiner.cpp:13118-13121
@@ -13079,2 +13117,6 @@
 
+  // Try to convert a constant mask AND into a shuffle clear mask.
+  if (SDValue Shuffle = XformToShuffleWithZero(N))
+    return Shuffle;
+
   // Type legalization might introduce new shuffles in the DAG.
----------------
chandlerc wrote:
> Is there a particular reason to need to adjust this in this way?
We carry on constant folding of binops - in this case (AND c1 c2) - a lot later in the lowering stages than we do constant folding of bitcasts and shuffles, so a number of tests started failing as they'd been converted to bitcasted shuffles, resulting in a lot of unnecessary code.

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:6884-6885
@@ +6883,4 @@
+
+    if (SDValue Masked = lowerVectorShuffleAsBitMask(DL, VT, V1, V2, Mask, DAG))
+      return Masked;
+
----------------
chandlerc wrote:
> Please send a separate review with just this (the bit mask lowering) change?
> 
> It seems likely it should be testable independently.
Easily done (and understandable...) - I'll do a NFC commit beforehand to move the unaltered function so that the diff for a subpatch is as clear as possible.

================
Comment at: test/CodeGen/X86/sse3.ll:272-275
@@ -271,4 +271,6 @@
 ; X64:       ## BB#0: ## %entry
-; X64-NEXT:    movddup {{.*#+}} xmm0 = mem[0,0]
-; X64-NEXT:    andpd {{.*}}(%rip), %xmm0
+; X64-NEXT:    movaps (%rax), %xmm0
+; X64-NEXT:    unpcklps {{.*#+}} xmm0 = xmm0[0,0,1,1]
+; X64-NEXT:    pxor %xmm1, %xmm1
+; X64-NEXT:    punpckldq {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]
 ; X64-NEXT:    retq
----------------
chandlerc wrote:
> This seems like a pretty bad regression. Any hope of recovering it?
Yes - it requires DAGCombiner::visitVECTOR_SHUFFLE to be able to peek through shuffles of bitcasts of two inputs, so far it only handles the one input case. That should fix both the domain crossing and number of instructions. I've moved it up my todo list.

================
Comment at: test/CodeGen/X86/vec_cast2.ll:49-54
@@ -48,6 +48,8 @@
 ; CHECK:       ## BB#0:
-; CHECK-NEXT:    vpmovzxwd {{.*#+}} xmm1 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero
-; CHECK-NEXT:    vpunpckhwd {{.*#+}} xmm0 = xmm0[4,4,5,5,6,6,7,7]
-; CHECK-NEXT:    vinsertf128 $1, %xmm0, %ymm1, %ymm0
-; CHECK-NEXT:    vandps LCPI2_0, %ymm0, %ymm0
+; CHECK-NEXT:    vpunpckhwd {{.*#+}} xmm1 = xmm0[4,4,5,5,6,6,7,7]
+; CHECK-NEXT:    vmovdqa {{.*#+}} xmm2 = [255,0,0,0,255,0,0,0,255,0,0,0,255,0,0,0]
+; CHECK-NEXT:    vpand %xmm2, %xmm1, %xmm1
+; CHECK-NEXT:    vpmovzxwd {{.*#+}} xmm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero
+; CHECK-NEXT:    vpand %xmm2, %xmm0, %xmm0
+; CHECK-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
 ; CHECK-NEXT:    vcvtdq2ps %ymm0, %ymm0
----------------
chandlerc wrote:
> Same question as above...
This is the zero-extend issue that I mentioned in the summary. Yak shaving......

================
Comment at: test/CodeGen/X86/vec_int_to_fp.ll:1762-1765
@@ -1771,2 +1761,6 @@
 ; AVX1-NEXT:    vpunpckhwd {{.*#+}} xmm1 = xmm1[4,4,5,5,6,6,7,7]
+; AVX1-NEXT:    vmovdqa {{.*#+}} xmm2 = [255,0,0,0,255,0,0,0,255,0,0,0,255,0,0,0]
+; AVX1-NEXT:    vpand %xmm2, %xmm1, %xmm1
+; AVX1-NEXT:    vpmovzxbd {{.*#+}} xmm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero
+; AVX1-NEXT:    vpand %xmm2, %xmm0, %xmm0
 ; AVX1-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
----------------
chandlerc wrote:
> and here, wow!
This is the zero-extend issue that I mentioned in the summary.


Repository:
  rL LLVM

http://reviews.llvm.org/D11518