[PATCH] D23446: [X86] Enable setcc to srl(ctlz) transformation on btver2 architectures.
Sanjay Patel via llvm-commits
llvm-commits at lists.llvm.org
Fri Aug 19 06:59:35 PDT 2016
spatel added a comment.
In https://reviews.llvm.org/D23446#519879, @pgousseau wrote:
> With this change the total size of openssl is smaller by at most 0.5%.
> This is because while some matches led to bigger code, the majority led to smaller code by removing 1 or 2 instructions per match. So maybe we should not protect this with optForSize because it seems the size can go both ways?
>
> Here are the text size from openssl with and without the change.
>
> 12458547 libcrypto.lzcnt.a.txt
> 12460372 libcrypto.nolzcnt.a.txt
> -> 0.01% size decrease with change enabled
> 2453571 libssl.lzcnt.a.txt
> 2454996 libssl.nolzcnt.a.txt
> -> 0.05% size decrease with change enabled
>
>
> Here is an example from libcrypto where 2 instructions is saved:
>
> f3 45 0f bd f6 lzcnt %r14d,%r14d
> 41 c1 ee 05 shr $0x5,%r14d
>
> 31 c0 xor %eax,%eax
> 45 85 f6 test %r14d,%r14d
> 0f 94 c0 sete %al
> 41 89 c6 mov %eax,%r14d
This is the size savings that I was imagining based on the test cases. Given that the transform may or may not *decrease* code size, we should not guard this with optForSize. Please remove that check from the code. You can leave the additional test cases for minsize/optsize (and add a comment to explain) or remove them too.
> For speed measurements I am running a microbenchmark using google's libbenchmark.
> Let me know if you want me to email you the source.
> Here are the results on a jaguar cpu.
> "old " means without the optimisation.
> "new " means with the optimisation.
> f1 is for the icmp pattern
> f2 is for the icmp/icmp/or pattern
> The numbers 8/512/8k are the number of iterations the function is being run.
> The functions contains a 100 block a inline assembly corresponding to the test cases in the patch.
> My understanding is that the perf win observed in those results comes from the presence of less instruction which all have the same latency/throughput 1/0.5.
>
> Run on (6 X 1593.74 MHz CPU s)
> 2016-08-18 19:00:36
> Benchmark Time CPU Iterations
> --------------------------------------------------------
> BM_f1_old/8 784 ns 784 ns 893325
> BM_f1_old/512 49911 ns 49911 ns 14025
> BM_f1_old/8k 798898 ns 798898 ns 876
> BM_f1_new/8 585 ns 585 ns 1196970
> BM_f1_new/512 37170 ns 37170 ns 18830
> BM_f1_new/8k 595136 ns 595135 ns 1175
> BM_f2_old/8/8 13573 ns 13574 ns 51548
> BM_f2_old/512/512 55446038 ns 55446001 ns 13
> BM_f2_old/8k/8k 14212025166 ns 14212028980 ns 1
> BM_f2_new/8/8 9126 ns 9127 ns 76692
> BM_f2_new/512/512 37212798 ns 37212874 ns 19
> BM_f2_new/8k/8k 9533737898 ns 9533742905 ns 1
Please correct me if I'm not understanding: this ctlz patch gives us 34% better perf; the planned follow-on will raise that to +49%.
This is much bigger than I expected. I see that we can save a few xor and mov instructions, but my mental model was that those are nearly free. Is it possible that we've managed to shrink the decode-limited inner loop past some HW limit? Is code alignment a factor?
@andreadb / @RKSimon, do you have any insight/explanation for the perf difference?
https://reviews.llvm.org/D23446
More information about the llvm-commits
mailing list