[PATCH] D22814: [X86][SSE] Optimize the truncation of vector comparison results with PACKSS

Thu Jul 28 07:28:18 PDT 2016

RKSimon added inline comments.

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:14150
@@ +14149,3 @@
+  // No truncation required.
+  if (SrcVT == DstVT)
+    return In;
----------------
mkuper wrote:
> Is this valid? If not, maybe assert instead?
> For IR-level truncs, we explicitly disallow equal types, and I'd be slightly surprised if we allowed them for the ISD.
> 
I used this to make the recursion calls simpler (otherwise they all need equivalent tests). I can add a comment to explain this.

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:14185
@@ +14184,3 @@
+
+    // 256-bit PACKSS(ARG0, ARG1) leaves us with ((LO0,LO1),(HI0,HI1)),
+    // so we need to shuffle to get ((LO0,HI0),(LO1,HI1)).
----------------
mkuper wrote:
> Sorry, I'm still confused about this.
> 
> We just truncated, say, a v8i64 into a v8i32. The high 4 values should have ended up in the high 4 lanes of the ymm, and the low 4 values should have ended up in the 4 low lanes. Why isn't this what we want?
We're truncating a 8i64 which will be 2 YMMs of 4i64 each:

YMM0 = ((i64[0], i64[1]), (i64[2], i64[3]))
YMM1 = ((i64[4], i64[5]), (i64[6], i64[7]))

PACKSS acts on each 128-bit vector lane separately:

PACKSS(((i64[0], i64[1]), (i64[2], i64[3])), ((i64[4], i64[5]), (i64[6], i64[7])))
-->
(i32[0], i32[1], i32[4], i32[5]),(i32[2], i32[3], i32[6], i32[7])
--> PERMQ
(i32[0], i32[1], i32[2], i32[3]),(i32[4], i32[5], i32[6], i32[7])

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:14200
@@ +14199,3 @@
+  // Recursively pack lower/upper subvectors, concat result and pack again.
+  assert(SrcVT.getSizeInBits() >= 512 && "Expected 512-bit vector or greater");
+  EVT PackedVT = EVT::getVectorVT(*DAG.getContext(), PackedSVT, NumElems / 2);
----------------
mkuper wrote:
> I may have missed it, but I didn't see this is checked explicitly anywhere in the callers.
> What happens if you hit the combine version of this, pre-legalization, truncating from v32i64 to v32i8, for instance? The destination is a 256-bit type, and the source is simple.
> 
> (Also, greater -> smaller)
Near the start of the function we check that the destination value type is a 128-bit vector or more, and combineVectorCompareTruncation sanitizes the inputs. I'll beef up the tests/asserts at the start of truncateVectorCompareWithPACKSS just to be safe though.

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:14280
@@ +14279,3 @@
+    unsigned Opcode = V.getOpcode();
+    return (Opcode == X86ISD::PCMPGT || Opcode == X86ISD::PCMPEQ ||
+            Opcode == X86ISD::CMPP);
----------------
mkuper wrote:
> How about VPCOM and VPCOMU?
Yes I can add those, I just hadn't made any test cases for them yet. I'd prefer to do all that in a  post-commit.

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:29869
@@ +29868,3 @@
+  SDValue In = N->getOperand(0);
+  if (In.getOpcode() != ISD::SETCC || !In.getValueType().isSimple())
+    return SDValue();
----------------
delena wrote:
> It should be another function that says that all bits in each element all-ones or all-zeroes. It is not only compare. SIGN_EXTEND_IN_REG has the same effect, for example.
Yes, although it quickly gets more complicated with regard to acceptable PACKSS input sizes - which is why I simplified a lot of this code just to call PACKSS with i16 inputs (which only compares can easily conform to). I'll add a TODO comment.

Repository:
  rL LLVM

https://reviews.llvm.org/D22814