[PATCH] D15408: [AArch64/LoopUnrollRuntime] Don't avoid high-cost trip count computation on the AArch64

Fri Dec 11 09:49:02 PST 2015

Hi Junmo,

I tried out your patch on top of r254864, on a juno board, running on 
Cortex-A57.
I see the following results:

Performance Regressions - Execution Time 	Δ
lnt.MultiSource/Benchmarks/Ptrdist/yacr2/yacr2 
<http://llvm-test.cambridge.arm.com:8000/db_default/v4/nts/3523/graph?test.170=3> 
	9.17%
lnt.SingleSource/Benchmarks/Shootout-C++/ackermann 
<http://llvm-test.cambridge.arm.com:8000/db_default/v4/nts/3523/graph?test.264=3> 
	8.02%
lnt.MultiSource/Benchmarks/Trimaran/enc-pc1/enc-pc1 
<http://llvm-test.cambridge.arm.com:8000/db_default/v4/nts/3523/graph?test.149=3> 
	4.78%
spec.cpu2006.ref.445_gobmk 
<http://llvm-test.cambridge.arm.com:8000/db_default/v4/nts/3523/graph?test.176=3> 
	1.84%
spec.cpu2006.ref.483_xalancbmk 
<http://llvm-test.cambridge.arm.com:8000/db_default/v4/nts/3523/graph?test.94=3> 
	1.75%
spec.cpu2006.ref.471_omnetpp 
<http://llvm-test.cambridge.arm.com:8000/db_default/v4/nts/3523/graph?test.294=3> 
	1.43%
spec.cpu2000.ref.253_perlbmk 
<http://llvm-test.cambridge.arm.com:8000/db_default/v4/nts/3523/graph?test.337=3> 
	1.22%
lnt.SingleSource/Benchmarks/Polybench/linear-algebra/kernels/symm/symm 
<http://llvm-test.cambridge.arm.com:8000/db_default/v4/nts/3523/graph?test.135=3> 
	1.10%

Performance Improvements - Execution Time 	Δ
lnt.MultiSource/Benchmarks/MiBench/automotive-susan/automotive-susan 
<http://llvm-test.cambridge.arm.com:8000/db_default/v4/nts/3523/graph?test.15=3> 
	-23.07%
lnt.SingleSource/Benchmarks/Shootout/sieve 
<http://llvm-test.cambridge.arm.com:8000/db_default/v4/nts/3523/graph?test.40=3> 
	-9.50%
lnt.SingleSource/Benchmarks/BenchmarkGame/nsieve-bits 
<http://llvm-test.cambridge.arm.com:8000/db_default/v4/nts/3523/graph?test.9=3> 
	-7.26%
lnt.SingleSource/Benchmarks/BenchmarkGame/recursive 
<http://llvm-test.cambridge.arm.com:8000/db_default/v4/nts/3523/graph?test.316=3> 
	-3.42%
spec.cpu2006.ref.433_milc 
<http://llvm-test.cambridge.arm.com:8000/db_default/v4/nts/3523/graph?test.235=3> 
	-1.12%

While there are a few big jumps in the test-suite, I think the 
regressions show this is not
uniformely an improvement for performance.

Thanks,

Kristof

On 11/12/2015 07:43, Junmo Park via llvm-commits wrote:
> flyingforyou added a comment.
>
> Thanks Zhaoshi.
>
> I've just run a bunch of benchmarking including test-suite on Juno(Cortex-A57), there were many improvements and some regressions.
> The performance results of test-suite show 1.33% improvement and incur 0.78% regression.
> To compute composite benchmark result value, geometric mean is used.
>
> Actually I found some regression after merging  r234846.
> url: http://reviews.llvm.org/D8994
>
> After this commit merged, @hfinkel upload new commit r237947.
>
>> On X86 (and similar OOO cores) unrolling is very limited, and even if the runtime unrolling is otherwise profitable, the expense of a division to compute the trip count could greatly outweigh the benefits. On the A2, we unroll a lot, and the benefits of unrolling are more significant (seeing a 5x or 6x speedup is not uncommon), so we're more able to tolerate the expense, on average, of adivision to compute the trip count.
>
> I totally agree with this comment. Most of AArch64 Cores support h/w divider including floating point. So I think we can have unrolling oppotunity more.
>
>
> http://reviews.llvm.org/D15408
>
>
>
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20151211/e0faa25f/attachment.html>