[PATCH] D23446: [X86] Enable setcc to srl(ctlz) transformation on btver2 architectures.
pierre gousseau via llvm-commits
llvm-commits at lists.llvm.org
Thu Aug 18 11:34:45 PDT 2016
pgousseau added a comment.
In https://reviews.llvm.org/D23446#519626, @spatel wrote:
> In https://reviews.llvm.org/D23446#519225, @pgousseau wrote:
>
> > - Disable transform if optForSize is true
>
>
> optForSize is a very gray area: we allow speed optimizations even if they increase size if the speed vs. size trade-off is "large" for some definition of "large".
>
> Can you post the detailed perf and size differences you're seeing with this change? I don't think the size change can be that big from what you've posted: lzcnt+shr is 7 bytes; {test/inc}+set is 5/6 bytes, but if there's an xor leading into it, that's 7/8 bytes.
>
> Are there other size-increasing changes happening as side effects that I'm not accounting for? It's also not clear why this is a perf win for Jaguar. Sorry for taking this long to ask, but why is test+set slower?
With this change the total size of openssl is smaller by at most 0.5%.
This is because while some matches led to bigger code, the majority led to smaller code by removing 1 or 2 instructions per match. So maybe we should not protect this with optForSize because it seems the size can go both ways?
Here are the text size from openssl with and without the change.
12458547 libcrypto.lzcnt.a.txt
12460372 libcrypto.nolzcnt.a.txt
-> 0.01% size decrease with change enabled
2453571 libssl.lzcnt.a.txt
2454996 libssl.nolzcnt.a.txt
-> 0.05% size decrease with change enabled
Here is an example from libcrypto where 2 instructions is saved:
f3 45 0f bd f6 lzcnt %r14d,%r14d
41 c1 ee 05 shr $0x5,%r14d
31 c0 xor %eax,%eax
45 85 f6 test %r14d,%r14d
0f 94 c0 sete %al
41 89 c6 mov %eax,%r14d
For speed measurements I am running a microbenchmark using google's libbenchmark.
Let me know if you want me to email you the source.
Here are the results on a jaguar cpu.
"old " means without the optimisation.
"new " means with the optimisation.
f1 is for the icmp pattern
f2 is for the icmp/icmp/or pattern
The numbers 8/512/8k are the number of iterations the function is being run.
The functions contains a 100 block a inline assembly corresponding to the test cases in the patch.
My understanding is that the perf win observed in those results comes from the presence of less instruction which all have the same latency/throughput 1/0.5.
Run on (6 X 1593.74 MHz CPU s)
2016-08-18 19:00:36
Benchmark Time CPU Iterations
--------------------------------------------------------
BM_f1_old/8 784 ns 784 ns 893325
BM_f1_old/512 49911 ns 49911 ns 14025
BM_f1_old/8k 798898 ns 798898 ns 876
BM_f1_new/8 585 ns 585 ns 1196970
BM_f1_new/512 37170 ns 37170 ns 18830
BM_f1_new/8k 595136 ns 595135 ns 1175
BM_f2_old/8/8 13573 ns 13574 ns 51548
BM_f2_old/512/512 55446038 ns 55446001 ns 13
BM_f2_old/8k/8k 14212025166 ns 14212028980 ns 1
BM_f2_new/8/8 9126 ns 9127 ns 76692
BM_f2_new/512/512 37212798 ns 37212874 ns 19
BM_f2_new/8k/8k 9533737898 ns 9533742905 ns 1
Let me know if more detailed are required.
================
Comment at: test/CodeGen/X86/lzcnt-zext-cmp.ll:87-94
@@ +86,10 @@
+; CHECK-LABEL: foo5:
+; CHECK: # BB#0:
+; CHECK-NEXT: xorl %eax, %eax
+; CHECK-NEXT: testw %di, %di
+; CHECK-NEXT: sete %al
+; CHECK-NEXT: # kill: %AX<def> %AX<kill> %EAX<kill>
+; CHECK-NEXT: retq
+;
+; NOFASTLZCNT-LABEL: foo5:
+; NOFASTLZCNT: # BB#0:
----------------
Sounds good, will do thanks.
https://reviews.llvm.org/D23446
More information about the llvm-commits
mailing list