[PATCH] D23446: [X86] Enable setcc to srl(ctlz) transformation on btver2 architectures.

Fri Aug 19 09:39:57 PDT 2016

pgousseau added a comment.

In https://reviews.llvm.org/D23446#520684, @spatel wrote:

> In https://reviews.llvm.org/D23446#519879, @pgousseau wrote:
>
> > With this change the total size of openssl is smaller by at most 0.5%. 
> >  This is because while some matches led to bigger code, the majority led to smaller code by removing 1 or 2 instructions per match. So maybe we should not protect this with optForSize because it seems the size can go both ways?
> >
> > Here are the text size from openssl with and without the change.
> >
> >   12458547 libcrypto.lzcnt.a.txt
> >   12460372 libcrypto.nolzcnt.a.txt
> >   -> 0.01% size decrease with change enabled
> >   2453571 libssl.lzcnt.a.txt
> >   2454996 libssl.nolzcnt.a.txt
> >   -> 0.05% size decrease with change enabled
> >    
> >
> > Here is an example from libcrypto where 2 instructions is saved:
> >
> >   f3 45 0f bd f6       	lzcnt  %r14d,%r14d
> >   41 c1 ee 05          	shr    $0x5,%r14d
> >   
> >   31 c0                	xor    %eax,%eax
> >   45 85 f6             	test   %r14d,%r14d
> >   0f 94 c0             	sete   %al
> >   41 89 c6             	mov    %eax,%r14d
>
>
> This is the size savings that I was imagining based on the test cases. Given that the transform may or may not *decrease* code size, we should not guard this with optForSize. Please remove that check from the code. You can leave the additional test cases for minsize/optsize (and add a comment to explain) or remove them too.

Sounds good, will remove the optForSize.

> 

> 

> > For speed measurements I am running a microbenchmark using google's libbenchmark.

> 

> >  Let me know if you want me to email you the source.

> 

> >  Here are the results on a jaguar cpu.

> 

> >  "old " means without the optimisation.

> 

> >  "new " means with the optimisation.

> 

> >  f1 is for the icmp pattern

> 

> >  f2 is for the icmp/icmp/or pattern

> 

> >  The numbers 8/512/8k are the number of iterations the function is being run.

> 

> >  The functions contains a 100 block a inline assembly corresponding to the test cases in the patch.

> 

> >  My understanding is that the perf win observed in those results comes from the presence of less instruction which all have the same latency/throughput 1/0.5.

> 

> > 

> 

> >   Run on (6 X 1593.74 MHz CPU s)

> 

> >   2016-08-18 19:00:36

> 

> >   Benchmark                  Time           CPU Iterations

> 

> >   --------------------------------------------------------

> 

> >   BM_f1_old/8              784 ns        784 ns     893325

> 

> >   BM_f1_old/512          49911 ns      49911 ns      14025

> 

> >   BM_f1_old/8k          798898 ns     798898 ns        876

> 

> >   BM_f1_new/8              585 ns        585 ns    1196970

> 

> >   BM_f1_new/512          37170 ns      37170 ns      18830

> 

> >   BM_f1_new/8k          595136 ns     595135 ns       1175

> 

> >   BM_f2_old/8/8          13573 ns      13574 ns      51548

> 

> >   BM_f2_old/512/512   55446038 ns   55446001 ns         13

> 

> >   BM_f2_old/8k/8k   14212025166 ns 14212028980 ns          1

> 

> >   BM_f2_new/8/8           9126 ns       9127 ns      76692

> 

> >   BM_f2_new/512/512   37212798 ns   37212874 ns         19

> 

> >   BM_f2_new/8k/8k   9533737898 ns 9533742905 ns          1

> 

> 

> Please correct me if I'm not understanding: this ctlz patch gives us 34% better perf; the planned follow-on will raise that to +49%.

>  This is much bigger than I expected. I see that we can save a few xor and mov instructions, but my mental model was that those are nearly free. Is it possible that we've managed to shrink the decode-limited inner loop past some HW limit? Is code alignment a factor?

Yes this benchmark seems to show those kind of improvements, for example BM_f1_new/8 is ~25% faster than BM_f1_old/8.
Although  now I reran the SPEC h264ref benchmark and unfortunately, I am seeing some small (less than 1%) but consistent performance degradation with h264ref that I wasn't seeing with the initial tablegen patch.
Why the micro-benchmark does not show this I am not sure, one hypothesis is that the micro-benchmark is only comparing the 32-bit "register, register" variant of the change, so it might be that I need to restrict the transformation to this pattern.
Or the micro-benchmark is not representative of the performance and in that case this change is probably not worth pursuing. Will come back with the findings.

> @andreadb / @RKSimon, do you have any insight/explanation for the perf difference?

https://reviews.llvm.org/D23446