[PATCH] D15408: [AArch64/LoopUnrollRuntime] Don't avoid high-cost trip count computation on the AArch64

Thu Dec 10 22:43:16 PST 2015

flyingforyou added a comment.

Thanks Zhaoshi.

I've just run a bunch of benchmarking including test-suite on Juno(Cortex-A57), there were many improvements and some regressions.
The performance results of test-suite show 1.33% improvement and incur 0.78% regression.
To compute composite benchmark result value, geometric mean is used.

Actually I found some regression after merging  r234846.
url: http://reviews.llvm.org/D8994

After this commit merged, @hfinkel upload new commit r237947.

> On X86 (and similar OOO cores) unrolling is very limited, and even if the runtime unrolling is otherwise profitable, the expense of a division to compute the trip count could greatly outweigh the benefits. On the A2, we unroll a lot, and the benefits of unrolling are more significant (seeing a 5x or 6x speedup is not uncommon), so we're more able to tolerate the expense, on average, of adivision to compute the trip count.

I totally agree with this comment. Most of AArch64 Cores support h/w divider including floating point. So I think we can have unrolling oppotunity more.

http://reviews.llvm.org/D15408