[llvm-dev] (RFC) Adjusting default loop fully unroll threshold

Fri Feb 10 15:23:22 PST 2017

On 02/10/2017 05:21 PM, Dehao Chen wrote:
> Thanks every for the comments.
>
> Do we have a decision here?

You're good to go as far as I'm concerned.

  -Hal

>
> Dehao
>
> On Tue, Feb 7, 2017 at 10:24 PM, Hal Finkel <hfinkel at anl.gov 
> <mailto:hfinkel at anl.gov>> wrote:
>
>
>     On 02/07/2017 05:29 PM, Sanjay Patel via llvm-dev wrote:
>>     Sorry if I missed it, but what machine/CPU are you using to
>>     collect the perf numbers?
>>
>>     I am concerned that what may be a win on a CPU that keeps a
>>     couple of hundred instructions in-flight and has many MB of
>>     caches will not hold for a small core.
>
>     In my experience, unrolling tends to help weaker cores even more
>     than stronger ones because it allows the instruction scheduler
>     more opportunities to hide latency. Obviously, instruction-cache
>     pressure is an important consideration, but the code size changes
>     here seems small.
>
>>
>>     Is the proposed change universal? Is there a way to undo it?
>
>     All of the unrolling thresholds should be target-adjustable using
>     the TTI::getUnrollingPreferences hook.
>
>      -Hal
>
>
>>
>>     On Tue, Feb 7, 2017 at 3:26 PM, Dehao Chen via llvm-dev
>>     <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
>>
>>         Ping... with the updated code size impact data, any more
>>         comments? Any more data that would be interesting to collect?
>>
>>         Thanks,
>>         Dehao
>>
>>         On Thu, Feb 2, 2017 at 2:07 PM, Dehao Chen <dehao at google.com
>>         <mailto:dehao at google.com>> wrote:
>>
>>             Here is the code size impact for clang, chrome and 24
>>             google internal benchmarks (name omited, 14 15 16 are
>>             encoding/decoding benchmarks similar as h264). There are
>>             2 columns, for threshold 300 and 450 respectively.
>>
>>             I also tested the llvm test suite. Changing the threshold
>>             to 300/450 does not affect code gen for any binary in the
>>             test suite.
>>
>>
>>
>>             	300 	450
>>             clang 	0.30% 	0.63%
>>             chrome 	0.00% 	0.00%
>>             1 	0.27% 	0.67%
>>             2 	0.44% 	0.93%
>>             3 	0.44% 	0.93%
>>             4 	0.26% 	0.53%
>>             5 	0.74% 	2.21%
>>             6 	0.74% 	2.21%
>>             7 	0.74% 	2.21%
>>             8 	0.46% 	1.05%
>>             9 	0.35% 	0.86%
>>             10 	0.35% 	0.86%
>>             11 	0.40% 	0.83%
>>             12 	0.32% 	0.65%
>>             13 	0.31% 	0.64%
>>             14 	4.52% 	8.23%
>>             15 	9.90% 	19.38%
>>             16 	9.90% 	19.38%
>>             17 	0.68% 	1.97%
>>             18 	0.21% 	0.48%
>>             19 	0.99% 	3.44%
>>             20 	0.19% 	0.46%
>>             21 	0.57% 	1.62%
>>             22 	0.37% 	1.05%
>>             23 	0.78% 	1.30%
>>             24 	0.51% 	1.54%
>>
>>
>>             On Wed, Feb 1, 2017 at 6:08 PM, Mikhail Zolotukhin via
>>             llvm-dev <llvm-dev at lists.llvm.org
>>             <mailto:llvm-dev at lists.llvm.org>> wrote:
>>
>>>                 On Feb 1, 2017, at 4:57 PM, Xinliang David Li via
>>>                 llvm-dev <llvm-dev at lists.llvm.org
>>>                 <mailto:llvm-dev at lists.llvm.org>> wrote:
>>>
>>>                 clang, chrome, and some internal large apps are good
>>>                 candidates for size metrics.
>>                 I'd also add the standard LLVM testsuite just because
>>                 it's the suite everyone in the community can use.
>>
>>                 Michael
>>>
>>>                 David
>>>
>>>                 On Wed, Feb 1, 2017 at 4:47 PM, Chandler Carruth via
>>>                 llvm-dev <llvm-dev at lists.llvm.org
>>>                 <mailto:llvm-dev at lists.llvm.org>> wrote:
>>>
>>>                     I had suggested having size metrics from
>>>                     somewhat larger applications such as Chrome,
>>>                     Webkit, or Firefox; clang itself; and maybe some
>>>                     of our internal binaries with rough size brackets?
>>>
>>>                     On Wed, Feb 1, 2017 at 4:33 PM Dehao Chen
>>>                     <dehao at google.com <mailto:dehao at google.com>> wrote:
>>>
>>>                         With the new data points, any comments on
>>>                         whether this can justify setting fully
>>>                         inline threshold to 300 (or any other
>>>                         number) in O2? I can collect more data
>>>                         points if it's helpful.
>>>
>>>                         Thanks,
>>>                         Dehao
>>>
>>>                         On Tue, Jan 31, 2017 at 3:20 PM, Dehao Chen
>>>                         <dehao at google.com <mailto:dehao at google.com>>
>>>                         wrote:
>>>
>>>                             Recollected the data from trunk head
>>>                             with stddev data and more threshold data
>>>                             points attached:
>>>
>>>                             Performance:
>>>
>>>                             	stddev/mean 	300 	450 	600 	750
>>>                             403 	0.37% 	0.11% 	0.11% 	0.09% 	0.79%
>>>                             433 	0.14% 	0.51% 	0.25% 	-0.63% 	-0.29%
>>>                             445 	0.08% 	0.48% 	0.89% 	0.12% 	0.83%
>>>                             447 	0.16% 	3.50% 	2.69% 	3.66% 	3.59%
>>>                             453 	0.11% 	1.49% 	0.45% 	-0.07% 	0.78%
>>>                             464 	0.17% 	0.75% 	1.80% 	1.86% 	1.54%
>>>
>>>
>>>                             Code size:
>>>
>>>                             	300 	450 	600 	750
>>>                             403 	0.56% 	2.41% 	2.74% 	3.75%
>>>                             433 	0.96% 	2.84% 	4.19% 	4.87%
>>>                             445 	2.16% 	3.62% 	4.48% 	5.88%
>>>                             447 	2.96% 	5.09% 	6.74% 	8.89%
>>>                             453 	0.94% 	1.67% 	2.73% 	2.96%
>>>                             464 	8.02% 	13.50% 	20.51% 	26.59%
>>>
>>>
>>>                             Compile time is proportional in the
>>>                             experiments and more noisy, so I did not
>>>                             include it.
>>>
>>>                             We have >2% speedup on some google
>>>                             internal benchmarks when switching the
>>>                             threshold from 150 to 300.
>>>
>>>                             Dehao
>>>
>>>                             On Mon, Jan 30, 2017 at 5:06 PM,
>>>                             Chandler Carruth <chandlerc at google.com
>>>                             <mailto:chandlerc at google.com>> wrote:
>>>
>>>                                 On Mon, Jan 30, 2017 at 4:59 PM
>>>                                 Mehdi Amini <mehdi.amini at apple.com
>>>                                 <mailto:mehdi.amini at apple.com>> wrote:
>>>
>>>>
>>>>
>>>>                                             Another question is
>>>>                                             about PGO integration:
>>>>                                             is it already hooked
>>>>                                             there? Should we have a
>>>>                                             more aggressive
>>>>                                             threshold in a hot
>>>>                                             function? (Assuming
>>>>                                             we’re willing to spend
>>>>                                             some binary size there
>>>>                                             but not on the cold path).
>>>>
>>>>
>>>>                                         I would even wire the
>>>>                                         *unrolling* the other way:
>>>>                                         just suppress unrolling in
>>>>                                         cold paths to save binary
>>>>                                         size. rolled loops seem
>>>>                                         like a generally good thing
>>>>                                         in cold code unless they
>>>>                                         are having some larger
>>>>                                         impact (IE, the loop itself
>>>>                                         is more expensive than the
>>>>                                         unrolled form).
>>>>
>>>>
>>>>
>>>>                                     Agree that we could suppress
>>>>                                     unrolling in cold path to save
>>>>                                     code size. But that's
>>>>                                     orthogonal with the propose
>>>>                                     here. This proposal focuses on
>>>>                                     O2 performance: shall we have
>>>>                                     different (higher) fully unroll
>>>>                                     threshold than dynamic/partial
>>>>                                     unroll.
>>>
>>>                                     I agree that this is (to some
>>>                                     extent) orthogonal, and it makes
>>>                                     sense to me to differentiate the
>>>                                     threshold for full unroll and
>>>                                     the dynamic/partial case.
>>>
>>>
>>>                                 There is one issue that makes these
>>>                                 not orthogonal.
>>>
>>>                                 If even *static* profile hints will
>>>                                 reduce some of the code size
>>>                                 increase caused by higher unrolling
>>>                                 thresholds for non-cold code, we
>>>                                 should factor that into the tradeoff
>>>                                 in picking where the threshold goes.
>>>
>>>                                 However, getting PGO into the full
>>>                                 unroller is currently challenging
>>>                                 outside of the new pass manager. We
>>>                                 already have some unfortunate hacks
>>>                                 around this in LoopUnswitch that are
>>>                                 making the port of it to the new PM
>>>                                 more annoying.
>>>
>>>
>>>
>>>
>>>                     _______________________________________________
>>>                     LLVM Developers mailing list
>>>                     llvm-dev at lists.llvm.org
>>>                     <mailto:llvm-dev at lists.llvm.org>
>>>                     http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>                     <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>>>
>>>
>>>                 _______________________________________________
>>>                 LLVM Developers mailing list
>>>                 llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>>>                 http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>                 <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>>
>>
>>                 _______________________________________________
>>                 LLVM Developers mailing list
>>                 llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>>                 http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>                 <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>>
>>
>>
>>
>>         _______________________________________________
>>         LLVM Developers mailing list
>>         llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>>         http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>         <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>>
>>
>>
>>
>>     _______________________________________________
>>     LLVM Developers mailing list
>>     llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>>     http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>     <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>
>     -- 
>     Hal Finkel
>     Lead, Compiler Technology and Programming Languages
>     Leadership Computing Facility
>     Argonne National Laboratory
>
-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170210/63eed20d/attachment-0001.html>