[llvm-dev] (RFC) Adjusting default loop fully unroll threshold

Thu Feb 16 19:06:06 PST 2017

On Thu, Feb 16, 2017 at 5:43 PM Mehdi Amini <mehdi.amini at apple.com> wrote:

> On Feb 16, 2017, at 4:41 PM, Xinliang David Li via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> On Thu, Feb 16, 2017 at 3:45 PM, Chandler Carruth via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> First off, I just want to say wow and thank you. This kind of data is
> amazing. =D
>
> On Thu, Feb 16, 2017 at 2:46 AM Kristof Beyls <Kristof.Beyls at arm.com>
> wrote:
>
> The biggest relative code size increases indeed didn't happen for the
> biggest programs, but instead for a few programs weighing in at about 100KB.
> I'm assuming the Google benchmark set covers much bigger programs than the
> ones displayed here.
> FWIW, the cluster of programs where code size increases between 60% to 80%
> with a size of about 100KB, all come from MultiSource/Benchmarks/TSVC.
> Interestingly, these programs seem to have float and double variants,  e.g.
> (MultiSource/Benchmarks/TSVC/Searching-flt/Searching-flt and
> MultiSource/Benchmarks/TSVC/Searching-dbl/Searching-dbl), and the code size
> bloat only happens for the double variants.
>
>
> I think we should definitely look at this (as it seems likely to be a bug
> somewhere), but I'm also not overly concerned with size regressions in the
> TSVC benchmarks which are unusually loop heavy and small. We've have
> several other changes that caused big fluctuations here.
>
>
>
> I think it may still be worthwhile to check if this also happens on other
> architectures, and why it happens only for the double-variants, not the
> float-variants.
>
>
> +1
>
> The second chart shows relative code size increase (vertical axis) vs
> relative performance improvement (horizontal axis):
> I manually checked the cause of the 3 biggest performance regressions
> (proprietary benchmark1: -13.70%;
> MultiSource/Applications/hexxagon/hexxagon: -10.10%;
> MultiSource/Benchmarks/FreeBench/fourinarow/fourinarow -5.23%).
> For the proprietary benchmark and hexxagon, the code generation didn't
> change for the hottest parts, so probably is caused by micro-architectural
> effects of code layout changes.
>
>
> This is always good to know, even though it is frustrating. =]
>
>
> For fourinarow, there seemed to be a lot more spill/fill code, so probably
> due to non-optimality of register allocation.
>
>
> This is something we should probably look at. If you have the output lying
> around, maybe file a PR about it?
>
> The third chart below just zooms in on the above chart to the -5% to 5%
> performance improvement range:
> <unroll_codesize_vs_performance_zoom.png>
>
>
> Whether to enable the increase in unroll threshold only at O3 or also at
> O2: I don't have a strong opinion based on the above data.
>
>
> FWIW, this data seems to clearly indicate that we don't get performance
> wins with any consistency when the code size goes up (and thus the change
> has impact). As a consequence, I pretty strongly suspect that this should
> be *just* used at O3 at least for now.
>
>
> The correlation is there -- when there is performance improvement, there
> is size increase.
>
>
> I didn’t quite get this impression from the graph, the highest improvement
> didn’t come with code size increase:
>
>
>
>
> And on the other hand there were many code-size increase without any
> runtime improvement.
>

the zoomed in graph does not show the trend indeed, but the original graph
does 😀

>
>
> The opposite is not true -- but that is expected. If the speedup is in the
> cold path, there won't be visible performance improvement but size increase.
>
> Put it another way. If we reduce the threshold, there will be sizable size
> improvement for many benchmarks without regressing performance, shall we
> use the reduced threshold for O2 instead?
>
>
> Yes, all the ones here IIUC:
>
>
>
>
> However it is likely that we could consider these “small” benchmarks
> should use -Os if they're sensitive to size, and so O2 would be fine with
> the more aggressive threshold (as larger program aren’t affected).
>
> With good heuristic we’d have every dot forming a straight line
> code_size_increase = m * runtime_perf (with m as small as possible). The
> current lack of shape (or the exact opposite distribution to the ideal I
> imagine above) seems to show that our "profitability” heuristics are pretty
> bad and the current threshold knob is bad predictor of the runtime
> performance.
>

To get that, profile data is needed. Downstream component bugs also makes
it hard to achieve.

David

>
> —
> Mehdi
>
>
>
> It is usually tiny programs that are sensitive (size) to this change. The
> size vs size increase chart confirms that point. There is basically no
> large size increase for programs > 1MB (clang release build size is 78M).
> In other words, I believe the actual size impact on real world applications
> should be negligible.  This behavior is very different from the case when
> we increase inline threshold for instance -- which will have size impact
> across the board. The latter is certainly more limited to higher
> optimization levels.
>
> thanks,
>
> David
>
>
>
>
>
>
> I see two further directions for Dehao that make sense here (at least to
> me):
> 1) I suspect we should investigate *why* the size increases are happening
> without helping speed. I can imagine some reasons that this would of course
> happen (cold loops getting unrolled), but especially in light of the
> oddities you point out above, I suspect there may be issues where more
> unrolling is uncovering other problems and if we fix those other problems
> the shape of things will be different. We should at least address the
> issues you uncovered above.
>
> 2) If this turns out to be architecture specific (it seems that way at
> least initially, but hard to tell for sure with different benchmark sets)
> we might make AArch64 and x86 use different thresholds here. I'm skeptical
> about this though. I suspect we should do #1, and we'll either get a
> different shape, or just decide that O3 is more appropriate.
>
>
> Maybe the compile time impact is what should be driving that discussion
> the most? I'm afraid I don't have compile time numbers.
>
>
> FWIW, I strongly suspect that for *this* change, compile time and code
> size will be pretty precisely correlated. Dehao's data shows that to be
> true in several cases certainly.
>
>
> Ultimately, I guess this boils down to what exactly the difference is in
> intent between O2 and O3, which seems like a never-ending discussion...
>
>
> The definitions I am working from are here:
>
> https://github.com/llvm-project/llvm-project/blob/master/llvm/include/llvm/Passes/PassBuilder.h#L81-L90
>
> I've highlighted the part that makes me think O3 is better here: the code
> size increases (and thus compile time increases) don't seem to correspond
> to runtime improvements.
>
>
>
> Hoping you find this useful,
>
>
> Very. Once again, this kind of data and analysis is awesome. =D
>
>
> Kristof
>
>
> On Tue, Feb 14, 2017 at 1:06 PM Kristof Beyls via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> I've run the patch on https://reviews.llvm.org/D28368 on the test-suite
> and other benchmarks, for AArch64 -O3 -fomit-frame-pointer, both for
> Cortex-A53 and Cortex-A57.
>
> The geomean over the few hundred programs in there is roughly the same for
> Cortex-A53 and Cortex-A57: a bit over 1% improvement in execution speed for
> a bit over 5% increase in code size.
> Obviously I wouldn't want this for optimization levels where code size is
> of any concern, like -Os or -Oz, but don't have a problem with this going
> in for other optimization levels where this isn't a concern.
>
> Thanks,
>
> Kristof
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170217/53f51135/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screen Shot 2017-02-16 at 5.35.53 PM.png
Type: image/png
Size: 19584 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170217/53f51135/attachment-0001.png>