[llvm-dev] (RFC) Adjusting default loop fully unroll threshold

Mon Jan 30 16:56:13 PST 2017

On Mon, Jan 30, 2017 at 3:56 PM, Chandler Carruth <chandlerc at google.com>
wrote:

> On Mon, Jan 30, 2017 at 3:51 PM Mehdi Amini via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> On Jan 30, 2017, at 10:49 AM, Dehao Chen via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>> Currently, loop fully unroller shares the same default threshold as loop
>> dynamic unroller and partial unroller. This seems conservative because
>> unlike dynamic/partial unrolling, fully unrolling will not affect
>> LSD/ICache performance. In https://reviews.llvm.org/D28368, I proposed
>> to double the threshold for loop fully unroller. This will change the
>> codegen of several SPECCPU benchmarks:
>>
>> Code size:
>> 447.dealII 0.50%
>> 453.povray 0.42%
>> 433.milc 0.20%
>> 445.gobmk 0.32%
>> 403.gcc 0.05%
>> 464.h264ref 3.62%
>>
>> Compile Time:
>> 447.dealII 0.22%
>> 453.povray -0.16%
>> 433.milc 0.09%
>> 445.gobmk -2.43%
>> 403.gcc 0.06%
>> 464.h264ref 3.21%
>>
>> Performance (on intel sandybridge):
>> 447.dealII +0.07%
>> 453.povray +1.79%
>> 433.milc +1.02%
>> 445.gobmk +0.56%
>> 403.gcc -0.16%
>> 464.h264ref -0.41%
>>
>>
>> Can you clarify how to read these numbers? (I’m using +xx% to indicates a
>> slowdown usually, it seems you’re doing the opposite?).
>>
>
As this is comparing spec scores instead of run time, +xx% here means
speedup, -xx% means slowdown.

>
>> So considering 464.h264ref, does it mean it is 3.6% slower to compile,
>> gets 3.2% larger, and 0.4% slower?
>>
>
That is correct. The 0.4% slowdown is in the run-to-run noise range.

>
>> Another question is about PGO integration: is it already hooked there?
>> Should we have a more aggressive threshold in a hot function? (Assuming
>> we’re willing to spend some binary size there but not on the cold path).
>>
>
> I would even wire the *unrolling* the other way: just suppress unrolling
> in cold paths to save binary size. rolled loops seem like a generally good
> thing in cold code unless they are having some larger impact (IE, the loop
> itself is more expensive than the unrolled form).
>

Agree that we could suppress unrolling in cold path to save code size. But
that's orthogonal with the propose here. This proposal focuses on O2
performance: shall we have different (higher) fully unroll threshold than
dynamic/partial unroll.

We can have a separate patch to further boost threshold for hot loops and
suppress unrolling for cold loops. One concern is that in order to check if
a loop is hot/cold, we will need BFI for the loop pass. In the legacy loop
pass manager, this will insert a function pass in the middle of a series of
loop passes.

Dehao

>
>
>>
>> Thanks,
>>
>> —
>> Mehdi
>>
>>
>> Looks like the change has overall positive performance impact with very
>> small code size/compile time overhead. Now the question is shall we make
>> this change default in O2, or shall we leave it in O3. We would like to
>> have more input from the community to make the decision.
>>
>> Thanks,
>>
>> Dehao
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170130/e44929a0/attachment.html>