[PATCH] D28368: Give higher full-unroll boosting when the loop iteration is small.

Fri Jan 6 12:09:00 PST 2017

danielcdh added a comment.

In https://reviews.llvm.org/D28368#637730, @mzolotukhin wrote:

> > I compared the profile between our internal benchmark and 464.h264ref, the only difference is that the fully unrolled loop showed high up in the profile of our benchmark, while the fully unrolled loop is cold in 464.h264ref, thus it has no performance impact.
>
> Could you tell what exactly happens to the loop after unrolling? Do we get the performance improvement from just removing branches, or does unrolling enable later optimizations (if so, which ones)?

It's from reduced branch as well as loop preparation code (dynamic instruction reduced from 179 to 167, which has already been captured by the unroll size analysis (boosting = rolled_cost/unroll_cost = 179/167). However, for that specific case, we need a threshold of ~200 to make the fully unroll happen.

> 
> 
>> We could use profile info to allow more aggressive threshold only for hot loops, so that code size increase can be avoided. But this requires BFI within the loop pass, which will be expensive (compile time overhead).
> 
> Using profile info in loop-unrolling is definitely worthwhile. Most of the loops we unroll are actually in the cold parts, so if we can avoid unrolling them, we can save some budget for more aggressive unrolling in hot regions (or just get smaller code and faster compilation).

I agree profile can help get a good balance here, but https://reviews.llvm.org/owners/package/2/ build cannot benefit from it.

> 
> 
>> OTOH, if the heuristic is generally helping performance...
> 
> What heuristic are you referring to here?

Sorry, I meant the profile I proposed in this patch.

Thanks,
Dehao

> Michael

https://reviews.llvm.org/D28368