[PATCH] D23446: [X86] Enable setcc to srl(ctlz) transformation on btver2 architectures.
pierre gousseau via llvm-commits
llvm-commits at lists.llvm.org
Fri Aug 19 09:39:57 PDT 2016
pgousseau added a comment.
In https://reviews.llvm.org/D23446#520684, @spatel wrote:
> In https://reviews.llvm.org/D23446#519879, @pgousseau wrote:
>
> > With this change the total size of openssl is smaller by at most 0.5%.
> > This is because while some matches led to bigger code, the majority led to smaller code by removing 1 or 2 instructions per match. So maybe we should not protect this with optForSize because it seems the size can go both ways?
> >
> > Here are the text size from openssl with and without the change.
> >
> > 12458547 libcrypto.lzcnt.a.txt
> > 12460372 libcrypto.nolzcnt.a.txt
> > -> 0.01% size decrease with change enabled
> > 2453571 libssl.lzcnt.a.txt
> > 2454996 libssl.nolzcnt.a.txt
> > -> 0.05% size decrease with change enabled
> >
> >
> > Here is an example from libcrypto where 2 instructions is saved:
> >
> > f3 45 0f bd f6 lzcnt %r14d,%r14d
> > 41 c1 ee 05 shr $0x5,%r14d
> >
> > 31 c0 xor %eax,%eax
> > 45 85 f6 test %r14d,%r14d
> > 0f 94 c0 sete %al
> > 41 89 c6 mov %eax,%r14d
>
>
> This is the size savings that I was imagining based on the test cases. Given that the transform may or may not *decrease* code size, we should not guard this with optForSize. Please remove that check from the code. You can leave the additional test cases for minsize/optsize (and add a comment to explain) or remove them too.
Sounds good, will remove the optForSize.
>
>
> > For speed measurements I am running a microbenchmark using google's libbenchmark.
>
> > Let me know if you want me to email you the source.
>
> > Here are the results on a jaguar cpu.
>
> > "old " means without the optimisation.
>
> > "new " means with the optimisation.
>
> > f1 is for the icmp pattern
>
> > f2 is for the icmp/icmp/or pattern
>
> > The numbers 8/512/8k are the number of iterations the function is being run.
>
> > The functions contains a 100 block a inline assembly corresponding to the test cases in the patch.
>
> > My understanding is that the perf win observed in those results comes from the presence of less instruction which all have the same latency/throughput 1/0.5.
>
> >
>
> > Run on (6 X 1593.74 MHz CPU s)
>
> > 2016-08-18 19:00:36
>
> > Benchmark Time CPU Iterations
>
> > --------------------------------------------------------
>
> > BM_f1_old/8 784 ns 784 ns 893325
>
> > BM_f1_old/512 49911 ns 49911 ns 14025
>
> > BM_f1_old/8k 798898 ns 798898 ns 876
>
> > BM_f1_new/8 585 ns 585 ns 1196970
>
> > BM_f1_new/512 37170 ns 37170 ns 18830
>
> > BM_f1_new/8k 595136 ns 595135 ns 1175
>
> > BM_f2_old/8/8 13573 ns 13574 ns 51548
>
> > BM_f2_old/512/512 55446038 ns 55446001 ns 13
>
> > BM_f2_old/8k/8k 14212025166 ns 14212028980 ns 1
>
> > BM_f2_new/8/8 9126 ns 9127 ns 76692
>
> > BM_f2_new/512/512 37212798 ns 37212874 ns 19
>
> > BM_f2_new/8k/8k 9533737898 ns 9533742905 ns 1
>
>
> Please correct me if I'm not understanding: this ctlz patch gives us 34% better perf; the planned follow-on will raise that to +49%.
> This is much bigger than I expected. I see that we can save a few xor and mov instructions, but my mental model was that those are nearly free. Is it possible that we've managed to shrink the decode-limited inner loop past some HW limit? Is code alignment a factor?
Yes this benchmark seems to show those kind of improvements, for example BM_f1_new/8 is ~25% faster than BM_f1_old/8.
Although now I reran the SPEC h264ref benchmark and unfortunately, I am seeing some small (less than 1%) but consistent performance degradation with h264ref that I wasn't seeing with the initial tablegen patch.
Why the micro-benchmark does not show this I am not sure, one hypothesis is that the micro-benchmark is only comparing the 32-bit "register, register" variant of the change, so it might be that I need to restrict the transformation to this pattern.
Or the micro-benchmark is not representative of the performance and in that case this change is probably not worth pursuing. Will come back with the findings.
> @andreadb / @RKSimon, do you have any insight/explanation for the perf difference?
https://reviews.llvm.org/D23446
More information about the llvm-commits
mailing list