[PATCH] D23446: [X86] Enable setcc to srl(ctlz) transformation on btver2 architectures.

Wed Aug 17 11:01:07 PDT 2016

pgousseau added a comment.

Yes sounds like I should disable this if optForSize is enabled.

For example in openssl, out of 89 srl(lzcnt) optimisations, around 30 cases cause larger code, for example I see 25 of those:

  3055:	31 ed                	xor    %ebp,%ebp
  3057:	ff c2                	inc    %edx
  3059:	40 0f 94 c5          	sete   %bpl
  305d:	01 e9                	add    %ebp,%ecx

-> 10 bytes

  309d:	ff c2                	inc    %edx
  309f:	f3 0f bd ea          	lzcnt  %edx,%ebp
  30a3:	c1 ed 05             	shr    $0x5,%ebp
  30a6:	01 e9                	add    %ebp,%ecx

-> 12 bytes

But this should not affect performances.

Luckily openssl's libcrypto total size remains smaller as the other remaining matches result in fewer and as fast instructions.


================
Comment at: lib/CodeGen/SelectionDAG/TargetLowering.cpp:3566-3567
@@ -3565,3 +3565,4 @@
 
-SDValue TargetLowering::lowerCmpEqZeroToCtlzSrl(SDValue Op,
-                                                SelectionDAG &DAG) const {
+llvm::SDValue
+llvm::TargetLowering::lowerCmpEqZeroToCtlzSrl(SDValue Op, EVT ExtTy,
+                                              SelectionDAG &DAG) const {
----------------
spatel wrote:
> Remove "llvm::"
Sounds good, will do.

================
Comment at: lib/Target/X86/X86.td:265-266
@@ -264,1 +264,4 @@
                        "true", "Vector SQRT is fast (disable Newton-Raphson)">;
+// On some architectures, such as AMD's Jaguar, LZCNT is fast (as in as fast as
+// equivalent instructions).
+def FeatureFastLZCNT
----------------
spatel wrote:
> The description still seems too vague. The key point is that lzcnt must have the same latency and throughput as test/set, right?
> 
> How about: "If lzcnt has equivalent latency/throughput to most simple integer ops, it can be used to replace test/set sequences."
Sounds good thanks, will do.

================
Comment at: test/CodeGen/X86/lzcnt-zext-cmp.ll:85-92
@@ +84,10 @@
+; Test 16-bit input, 16-bit output.
+define i16 @foo5(i16 %a) {
+; CHECK-LABEL: foo5:
+; CHECK:       # BB#0:
+; CHECK-NEXT:    xorl %eax, %eax
+; CHECK-NEXT:    testw %di, %di
+; CHECK-NEXT:    sete %al
+; CHECK-NEXT:    # kill: %AX<def> %AX<kill> %EAX<kill>
+; CHECK-NEXT:    retq
+;
----------------
spatel wrote:
> lzcnt has a 16-bit variant in the ISA. Is there some reason not to use it here?
I disabled the 16-bit case because this seems to lead to bigger code in general, for this test case we would get:

  lzcntw  %di, %ax
  andl    $16, %eax
  shrl    $4, %eax
  retq

instead of:

  xorl    %eax, %eax
  testw   %di, %di
  sete    %al
  retq

which seems bigger code for the same result.


https://reviews.llvm.org/D23446