[PATCH] D28368: Give higher full-unroll boosting when the loop iteration is small.

Tue Jan 10 14:51:54 PST 2017

danielcdh added a comment.

In https://reviews.llvm.org/D28368#641747, @mzolotukhin wrote:

> > In the fully unroll, if the loop can be fully unrolled, it will not likely to trigger LSD (not enough trip count), nor will it affect the icache-miss (fully unrolled loop is streight-line code, no temporal locality, even if it's embedded in an outer-loop, the backedge of the outer loop should be easy to predict right). So if we assume all backend optimizations is sane (e.g. SLP performs as well as loop vectorizer, RA is doing good job in large BB, etc). As a result, larger code size should always lead to better performance for fully unroll. So a threshold here is purely limiting the size of the text.
>
> This is not exactly true in practice. If we just bump up the threshold, we'll see both performance improvements and regressions.

>From our limited experiments, bumping up the fully unroll threshold by 2X only improves performance for both speccpu and our internal benchmarks. If we boost it by 10X, we do see perf regression on some coding/decoding benchmarks. We root-caused the problem to be SLP cannot vectorize fully-unrolled code while loop vectorizer can. @mkuper is working on SLP to solve it. Other than that, it appears even boosting the threshold to 10X is a pure win for performance.

Could you point us to the benchmarks you observed regression after boosting fully unroll threshold? We would be happy to take a look and learn why performance get worse and possibly improve it. Thanks!

> 
> 
>> I think probably two types of unroller should not share the same threshold?
> 
> This makes sense. However, I prefer not to bloat our army of thresholds without a guaranteed benefit.

What do you mean by "guaranteed benefit"?

If it means "positive speedup with no code size/compile time increase", this seems impossible as any threshold boost will lead to code size boost.

If it means "positive speedup" only, it seems to be already satisfied.

> 
> 
>> How about we bump the threshold in O3, so that people who do not have profiler can still choose to fully unroll more aggressively?
> 
> For the change like this please submit a separate patch and include as much testing data as you can (including but not limited to SPEC, LLVM-testsuite, etc.). Please include runtime performance, compile time, and binary sizes.

I'll send out a new patch for this is we decided to put this in O3. During the mean time, I collected more performance data:

- update the data to remove the trip count logic and merely boost the fully unroll tripcount by 2X

| benchmark   | code size | compile time | performance |
| 447.dealII  | 0.52%     | -0.24%       | -0.94%      |
| 453.povray  | 0.45%     | -0.65%       | 3.00%       |
| 433.milc    | 0.20%     | 2.01%        | 0.47%       |
| 445.gobmk   | 0.32%     | -1.12%       | 0.32%       |
| 403.gcc     | 0.05%     | 0.58%        | 0.25%       |
| 464.h264ref | 4.04%     | 4.62%        | 0.28%       |

- build llvm testsuite with and without the change, it only affects the following 3 binaries. No noticeable compile time/run time has been observed.

| binary                                                   | code size change |
| CMakeFiles/CheckTypeSize/CMAKE_SIZEOF_UNSIGNED_SHORT.bin | 0.1%             |
| CMakeFiles/feature_tests.bin                             | 0%               |
| CMakeFiles/TestEndianess.bin                             | 0.1%             |

Thanks,
Dehao

> Thanks,
> Michael

https://reviews.llvm.org/D28368