[PATCH] D28368: Give higher full-unroll boosting when the loop iteration is small.

Fri Jan 6 13:38:35 PST 2017

danielcdh added a comment.

In https://reviews.llvm.org/D28368#638153, @mzolotukhin wrote:

> > Agree that we need more accurate model, but the problem is that even the model is 100% accurate, linear-boosting factor cannot help boost threshold big enough for our case.
>
> One problem with current cost model is that it's used for estimating both code size and runtime performance. It might be worth checking if we can gain anything from separating these two metrics more clearly - I think it was discussed in the past, but no decision has been made.

The relationship between code size and runtime performance is different between different unroller.

In the dynamic unroll and partial unroll, performance will initially increase as code size increases (because dynamic branch is reduced), but when it reaches a threshold, the performance will start to degrade when code size increase (due to i-cache miss increase and loop body no long fit into LSD, etc). So a fixed threshold is usually helpful to find the performance sweet-spot.

In the fully unroll, if the loop can be fully unrolled, it will not likely to trigger LSD (not enough trip count), nor will it affect the icache-miss (fully unrolled loop is streight-line code, no temporal locality, even if it's embedded in an outer-loop, the backedge of the outer loop should be easy to predict right). So if we assume all backend optimizations is sane (e.g. SLP performs as well as loop vectorizer, RA is doing good job in large BB, etc). As a result, larger code size should always lead to better performance for fully unroll. So a threshold here is purely limiting the size of the text.

If my above analysis is reasonable, then I think probably two types of unroller should not share the same threshold? And fully unroller may better have a larger threshold?

> 
> 
>> I agree profile can help get a good balance here, but https://reviews.llvm.org/owners/package/2/ build cannot benefit from it.
> 
> There are always cases where we generate sub-optimal code. For users striving for the outmost performance we provide higher optimization levels (+LTO, +PGO) and pragmas. We cannot just bump thresholds for every case we want to unroll/inline/whatever.

Sounds reasonable. How about we bump the threshold in O3, so that people who do not have profiler can still choose to fully unroll more aggressively?

Thanks,
Dehao

> 
> 
>> Sorry, I meant the profile I proposed in this patch.
> 
> Adding `Constant/TripCount` looks like simply bumping the threshold to me, except it also adds complexity to the code, so I'm not convinced we want this.
> 
> Michael

https://reviews.llvm.org/D28368