[PATCH] D23446: [X86] Enable setcc to srl(ctlz) transformation on btver2 architectures.

Fri Aug 19 06:59:35 PDT 2016

spatel added a comment.

In https://reviews.llvm.org/D23446#519879, @pgousseau wrote:

> With this change the total size of openssl is smaller by at most 0.5%. 
>  This is because while some matches led to bigger code, the majority led to smaller code by removing 1 or 2 instructions per match. So maybe we should not protect this with optForSize because it seems the size can go both ways?
>
> Here are the text size from openssl with and without the change.
>
>   12458547 libcrypto.lzcnt.a.txt
>   12460372 libcrypto.nolzcnt.a.txt
>   -> 0.01% size decrease with change enabled
>   2453571 libssl.lzcnt.a.txt
>   2454996 libssl.nolzcnt.a.txt
>   -> 0.05% size decrease with change enabled
>    
>
> Here is an example from libcrypto where 2 instructions is saved:
>
>   f3 45 0f bd f6       	lzcnt  %r14d,%r14d
>   41 c1 ee 05          	shr    $0x5,%r14d
>   
>   31 c0                	xor    %eax,%eax
>   45 85 f6             	test   %r14d,%r14d
>   0f 94 c0             	sete   %al
>   41 89 c6             	mov    %eax,%r14d

This is the size savings that I was imagining based on the test cases. Given that the transform may or may not *decrease* code size, we should not guard this with optForSize. Please remove that check from the code. You can leave the additional test cases for minsize/optsize (and add a comment to explain) or remove them too.

> For speed measurements I am running a microbenchmark using google's libbenchmark.

>  Let me know if you want me to email you the source.

>  Here are the results on a jaguar cpu.

>  "old " means without the optimisation.

>  "new " means with the optimisation.

>  f1 is for the icmp pattern

>  f2 is for the icmp/icmp/or pattern

>  The numbers 8/512/8k are the number of iterations the function is being run.

>  The functions contains a 100 block a inline assembly corresponding to the test cases in the patch.

>  My understanding is that the perf win observed in those results comes from the presence of less instruction which all have the same latency/throughput 1/0.5.

> 

>   Run on (6 X 1593.74 MHz CPU s)

>   2016-08-18 19:00:36

>   Benchmark                  Time           CPU Iterations

>   --------------------------------------------------------

>   BM_f1_old/8              784 ns        784 ns     893325

>   BM_f1_old/512          49911 ns      49911 ns      14025

>   BM_f1_old/8k          798898 ns     798898 ns        876

>   BM_f1_new/8              585 ns        585 ns    1196970

>   BM_f1_new/512          37170 ns      37170 ns      18830

>   BM_f1_new/8k          595136 ns     595135 ns       1175

>   BM_f2_old/8/8          13573 ns      13574 ns      51548

>   BM_f2_old/512/512   55446038 ns   55446001 ns         13

>   BM_f2_old/8k/8k   14212025166 ns 14212028980 ns          1

>   BM_f2_new/8/8           9126 ns       9127 ns      76692

>   BM_f2_new/512/512   37212798 ns   37212874 ns         19

>   BM_f2_new/8k/8k   9533737898 ns 9533742905 ns          1

Please correct me if I'm not understanding: this ctlz patch gives us 34% better perf; the planned follow-on will raise that to +49%.
This is much bigger than I expected. I see that we can save a few xor and mov instructions, but my mental model was that those are nearly free. Is it possible that we've managed to shrink the decode-limited inner loop past some HW limit? Is code alignment a factor?

@andreadb / @RKSimon, do you have any insight/explanation for the perf difference?

https://reviews.llvm.org/D23446