[llvm-dev] (RFC) Adjusting default loop fully unroll threshold

Thu Feb 16 02:46:15 PST 2017

On 15 Feb 2017, at 19:10, Chandler Carruth <chandlerc at gmail.com<mailto:chandlerc at gmail.com>> wrote:

Thanks for running these Kristof!

I'd still like to hear from Apple, and if we can get a few more x86 micro-architectures covered that'd be great, but it looks like -O3 is uncontroversial, and the question is whether this makes sense at O2...

To me, it would help a lot to know the actual breakdown of benchmarks such as yours Kristof (as they seem to have more codesize impact than others have mentioned). Specificially, are the runtime improvements correlated with the codesize increases? And what are the absolute size deltas? For *very* small benchmarks, a 5% code size fluctuation seems less concerning than for a larger benchmark. If the larger code size changes are mostly smaller benchmarks and reasonably correlated to the ones likely to see improvement from the change (this seemed to be the case w/ Dehao's data on x86 for example) that would to me indicate this makes sense at O2.

Note that I'm fine if you have to list the benchmarks as "1, 2, 3, ..." or whatever, much like we did for Google-internal benchmarks. It's still useful to know the shape of the change.

With this being data from a few hundred programs, I don't think listing the data in a long table really helps in getting a feel for the overall structure of the data.
Instead, I created a few scatter plots that hopefully helps in getting a better feel for the overall effect of the patch. The charts below are for the Cortex-A57 numbers. I decided not to produce a chart for Cortex-A53 as the shape of the data didn't seem very different. The optimization level used is -O3 -fomit-frame-pointer, targeting AArch64 linux.

The first chart shows relative code size increase (vertical axis) vs absolute code size:
The biggest relative code size increases indeed didn't happen for the biggest programs, but instead for a few programs weighing in at about 100KB.
I'm assuming the Google benchmark set covers much bigger programs than the ones displayed here.
FWIW, the cluster of programs where code size increases between 60% to 80% with a size of about 100KB, all come from MultiSource/Benchmarks/TSVC. Interestingly, these programs seem to have float and double variants,  e.g. (MultiSource/Benchmarks/TSVC/Searching-flt/Searching-flt and MultiSource/Benchmarks/TSVC/Searching-dbl/Searching-dbl), and the code size bloat only happens for the double variants. I think it may still be worthwhile to check if this also happens on other architectures, and why it happens only for the double-variants, not the float-variants.

[cid:C557D770-9D82-45EA-AA84-A5CB28B190EA]

The second chart shows relative code size increase (vertical axis) vs relative performance improvement (horizontal axis):
I manually checked the cause of the 3 biggest performance regressions (proprietary benchmark1: -13.70%; MultiSource/Applications/hexxagon/hexxagon: -10.10%; MultiSource/Benchmarks/FreeBench/fourinarow/fourinarow -5.23%).
For the proprietary benchmark and hexxagon, the code generation didn't change for the hottest parts, so probably is caused by micro-architectural effects of code layout changes.
For fourinarow, there seemed to be a lot more spill/fill code, so probably due to non-optimality of register allocation.

[cid:35438EFB-1337-4478-88C7-B8A718B61681]

The third chart below just zooms in on the above chart to the -5% to 5% performance improvement range:
[cid:C7AB0398-ED09-448D-BF28-5FD328D90350]

Whether to enable the increase in unroll threshold only at O3 or also at O2: I don't have a strong opinion based on the above data.
Maybe the compile time impact is what should be driving that discussion the most? I'm afraid I don't have compile time numbers.
Ultimately, I guess this boils down to what exactly the difference is in intent between O2 and O3, which seems like a never-ending discussion...

Hoping you find this useful,

Kristof

On Tue, Feb 14, 2017 at 1:06 PM Kristof Beyls via llvm-dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote:
I've run the patch on https://reviews.llvm.org/D28368 on the test-suite and other benchmarks, for AArch64 -O3 -fomit-frame-pointer, both for Cortex-A53 and Cortex-A57.

The geomean over the few hundred programs in there is roughly the same for Cortex-A53 and Cortex-A57: a bit over 1% improvement in execution speed for a bit over 5% increase in code size.
Obviously I wouldn't want this for optimization levels where code size is of any concern, like -Os or -Oz, but don't have a problem with this going in for other optimization levels where this isn't a concern.

Thanks,

Kristof

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170216/b7a68c98/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: unroll_codesize_absolute_vs_relative.png
Type: image/png
Size: 86966 bytes
Desc: unroll_codesize_absolute_vs_relative.png
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170216/b7a68c98/attachment-0003.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: unroll_codesize_vs_performance.png
Type: image/png
Size: 84065 bytes
Desc: unroll_codesize_vs_performance.png
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170216/b7a68c98/attachment-0004.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: unroll_codesize_vs_performance_zoom.png
Type: image/png
Size: 103095 bytes
Desc: unroll_codesize_vs_performance_zoom.png
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170216/b7a68c98/attachment-0005.png>