[PATCH] D23446: [X86] Enable setcc to srl(ctlz) transformation on btver2 architectures.

Thu Aug 18 11:34:45 PDT 2016

pgousseau added a comment.

In https://reviews.llvm.org/D23446#519626, @spatel wrote:

> In https://reviews.llvm.org/D23446#519225, @pgousseau wrote:
>
> > - Disable transform if optForSize is true
>
>
> optForSize is a very gray area: we allow speed optimizations even if they increase size if the speed vs. size trade-off is "large" for some definition of "large".
>
> Can you post the detailed perf and size differences you're seeing with this change? I don't think the size change can be that big from what you've posted: lzcnt+shr is 7 bytes; {test/inc}+set is 5/6 bytes, but if there's an xor leading into it, that's 7/8 bytes.
>
> Are there other size-increasing changes happening as side effects that I'm not accounting for? It's also not clear why this is a perf win for Jaguar. Sorry for taking this long to ask, but why is test+set slower?

With this change the total size of openssl is smaller by at most 0.5%. 
This is because while some matches led to bigger code, the majority led to smaller code by removing 1 or 2 instructions per match. So maybe we should not protect this with optForSize because it seems the size can go both ways?

Here are the text size from openssl with and without the change.

  12458547 libcrypto.lzcnt.a.txt
  12460372 libcrypto.nolzcnt.a.txt
  -> 0.01% size decrease with change enabled
  2453571 libssl.lzcnt.a.txt
  2454996 libssl.nolzcnt.a.txt
  -> 0.05% size decrease with change enabled

Here is an example from libcrypto where 2 instructions is saved:

  f3 45 0f bd f6       	lzcnt  %r14d,%r14d
  41 c1 ee 05          	shr    $0x5,%r14d

  31 c0                	xor    %eax,%eax
  45 85 f6             	test   %r14d,%r14d
  0f 94 c0             	sete   %al
  41 89 c6             	mov    %eax,%r14d

For speed measurements I am running a microbenchmark using google's libbenchmark.
Let me know if you want me to email you the source.
Here are the results on a jaguar cpu.
"old " means without the optimisation.
"new " means with the optimisation.
f1 is for the icmp pattern
f2 is for the icmp/icmp/or pattern
The numbers 8/512/8k are the number of iterations the function is being run.
The functions contains a 100 block a inline assembly corresponding to the test cases in the patch.
My understanding is that the perf win observed in those results comes from the presence of less instruction which all have the same latency/throughput 1/0.5.

  Run on (6 X 1593.74 MHz CPU s)
  2016-08-18 19:00:36
  Benchmark                  Time           CPU Iterations
  --------------------------------------------------------
  BM_f1_old/8              784 ns        784 ns     893325
  BM_f1_old/512          49911 ns      49911 ns      14025
  BM_f1_old/8k          798898 ns     798898 ns        876
  BM_f1_new/8              585 ns        585 ns    1196970
  BM_f1_new/512          37170 ns      37170 ns      18830
  BM_f1_new/8k          595136 ns     595135 ns       1175
  BM_f2_old/8/8          13573 ns      13574 ns      51548
  BM_f2_old/512/512   55446038 ns   55446001 ns         13
  BM_f2_old/8k/8k   14212025166 ns 14212028980 ns          1
  BM_f2_new/8/8           9126 ns       9127 ns      76692
  BM_f2_new/512/512   37212798 ns   37212874 ns         19
  BM_f2_new/8k/8k   9533737898 ns 9533742905 ns          1

Let me know if more detailed are required.

================
Comment at: test/CodeGen/X86/lzcnt-zext-cmp.ll:87-94
@@ +86,10 @@
+; CHECK-LABEL: foo5:
+; CHECK:       # BB#0:
+; CHECK-NEXT:    xorl %eax, %eax
+; CHECK-NEXT:    testw %di, %di
+; CHECK-NEXT:    sete %al
+; CHECK-NEXT:    # kill: %AX<def> %AX<kill> %EAX<kill>
+; CHECK-NEXT:    retq
+;
+; NOFASTLZCNT-LABEL: foo5:
+; NOFASTLZCNT:       # BB#0:
----------------
Sounds good, will do thanks.

https://reviews.llvm.org/D23446