[PATCH] D31290: [x86] use PMOVMSK to replace memcmp libcalls for 16-byte equality

Thu Mar 23 13:40:59 PDT 2017

spatel added inline comments.

================
Comment at: lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp:6098
+  LHSVal = DAG.getBitcast(CmpVT, LHSVal);
+  RHSVal = DAG.getBitcast(CmpVT, RHSVal);
+
----------------
efriedma wrote:
> What's the point of performing the load in a vector type if you're going to immediately bitcast the result to an integer type?  IIRC DAGCombine will fold this away.
I actually had it loading i128 to start, but I saw 2 problems:
1. The i128 loads+bitcasts weren't converted to vector loads directly. Legalization for x86-64 split this into i64 loads, and then we had to rely on the combiner to merge the loads. At the least, I think this would be slower to compile since it caused more nodes to be created and folded. At worst, we might not put the loads back together properly and that would lead to poor code. It wasn't clear to me that I could add a generic combine to do that either since some targets might not want that.
2. It wasn't honest to use i128 loads and bypass the isTypeLegal() check. We could make the TLI hook more specialized to account for that - have it confirm that loads of a given type/size are fast, so it's truly just a memcmp hook. But given the first problem, I got scared away.

================
Comment at: test/CodeGen/X86/memcmp.ll:104
+; CHECK-NEXT:    pmovmskb %xmm1, %eax
+; CHECK-NEXT:    cmpl $65535, %eax # imm = 0xFFFF
 ; CHECK-NEXT:    setne %al
----------------
efriedma wrote:
> What's the performance of this compared to using integer registers?  (movq+xorq+movq+xorq+orq).
Hmm...didn't consider that option since movmsk has been fast for a long time and scalar always needs more ops. We'd need to separate x86-32 from x86-64 too. I'll try to get some real numbers.

https://reviews.llvm.org/D31290