[llvm-dev] (RFC) Adjusting default loop fully unroll threshold

Thu Feb 16 19:57:40 PST 2017

On Thu, Feb 16, 2017 at 7:06 PM Xinliang David Li <xinliangli at gmail.com>
wrote:

> On Thu, Feb 16, 2017 at 5:43 PM Mehdi Amini <mehdi.amini at apple.com> wrote:
>
> On Feb 16, 2017, at 4:41 PM, Xinliang David Li via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> On Thu, Feb 16, 2017 at 3:45 PM, Chandler Carruth via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> First off, I just want to say wow and thank you. This kind of data is
> amazing. =D
>
> On Thu, Feb 16, 2017 at 2:46 AM Kristof Beyls <Kristof.Beyls at arm.com>
> wrote:
>
> The biggest relative code size increases indeed didn't happen for the
> biggest programs, but instead for a few programs weighing in at about 100KB.
> I'm assuming the Google benchmark set covers much bigger programs than the
> ones displayed here.
> FWIW, the cluster of programs where code size increases between 60% to 80%
> with a size of about 100KB, all come from MultiSource/Benchmarks/TSVC.
> Interestingly, these programs seem to have float and double variants,  e.g.
> (MultiSource/Benchmarks/TSVC/Searching-flt/Searching-flt and
> MultiSource/Benchmarks/TSVC/Searching-dbl/Searching-dbl), and the code size
> bloat only happens for the double variants.
>
>
> I think we should definitely look at this (as it seems likely to be a bug
> somewhere), but I'm also not overly concerned with size regressions in the
> TSVC benchmarks which are unusually loop heavy and small. We've have
> several other changes that caused big fluctuations here.
>
>
>
> I think it may still be worthwhile to check if this also happens on other
> architectures, and why it happens only for the double-variants, not the
> float-variants.
>
>
> +1
>
> The second chart shows relative code size increase (vertical axis) vs
> relative performance improvement (horizontal axis):
> I manually checked the cause of the 3 biggest performance regressions
> (proprietary benchmark1: -13.70%;
> MultiSource/Applications/hexxagon/hexxagon: -10.10%;
> MultiSource/Benchmarks/FreeBench/fourinarow/fourinarow -5.23%).
> For the proprietary benchmark and hexxagon, the code generation didn't
> change for the hottest parts, so probably is caused by micro-architectural
> effects of code layout changes.
>
>
> This is always good to know, even though it is frustrating. =]
>
>
> For fourinarow, there seemed to be a lot more spill/fill code, so probably
> due to non-optimality of register allocation.
>
>
> This is something we should probably look at. If you have the output lying
> around, maybe file a PR about it?
>
> The third chart below just zooms in on the above chart to the -5% to 5%
> performance improvement range:
> <unroll_codesize_vs_performance_zoom.png>
>
>
> Whether to enable the increase in unroll threshold only at O3 or also at
> O2: I don't have a strong opinion based on the above data.
>
>
> FWIW, this data seems to clearly indicate that we don't get performance
> wins with any consistency when the code size goes up (and thus the change
> has impact). As a consequence, I pretty strongly suspect that this should
> be *just* used at O3 at least for now.
>
>
> The correlation is there -- when there is performance improvement, there
> is size increase.
>
>
> I didn’t quite get this impression from the graph, the highest improvement
> didn’t come with code size increase:
>
>
>
>
> And on the other hand there were many code-size increase without any
> runtime improvement.
>
>
>
> the zoomed in graph does not show the trend indeed, but the original graph
> does 😀
>

I kind of agree with how Mehdi is interpreting this data. =]

>
>
> The opposite is not true -- but that is expected. If the speedup is in the
> cold path, there won't be visible performance improvement but size increase.
>
> Put it another way. If we reduce the threshold, there will be sizable size
> improvement for many benchmarks without regressing performance, shall we
> use the reduced threshold for O2 instead?
>
>
> Yes, all the ones here IIUC:
>
>
>
>
> However it is likely that we could consider these “small” benchmarks
> should use -Os if they're sensitive to size, and so O2 would be fine with
> the more aggressive threshold (as larger program aren’t affected).
>
> With good heuristic we’d have every dot forming a straight line
> code_size_increase = m * runtime_perf (with m as small as possible). The
> current lack of shape (or the exact opposite distribution to the ideal I
> imagine above) seems to show that our "profitability” heuristics are pretty
> bad and the current threshold knob is bad predictor of the runtime
> performance.
>
>
> To get that, profile data is needed. Downstream component bugs also makes
> it hard to achieve.
>

Sure. I think what I'm saying (don't want to put words in Mehdi's mouth) is
that until the bugs are fixed, this probably belongs at O3 rather than O2
based on this data.

FWIW, I suspect there might be some bug fixes that would change this even
without profile data, but its of course impossible to know until the bugs
are analyzed.

I don't think this should be a show stopper, I think the change is still a
really good one at O3.

>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170217/1b2c34c9/attachment.html>