[PATCH] D34373: [LV] Optimize for size when vectorizing loops with tiny trip count

Wed Jun 21 13:45:07 PDT 2017

Ayal added a comment.

Yes, we saw a couple of ~7% improvements running eembc benchmarks on x86.

This patch applies mostly to short static trip counts. For it to apply to short profile-based trip counts, they would need to be divisible by VF statically to avoid a remainder loop.

The current cost-model aims to estimate the relative performance of the loop body (only), when vectorized vs. original scalar version. The overheads of runtime guards and remainder loop may certainly outweigh the gains of the vectorized body, especially if the trip count is small; unless we know the former are not needed at all. If the body is expected to run faster when vectorized with a large trip count, it seems reasonable to expect it would do so with a small trip count, when all that's running is the body. Right?

Regarding unrolling such small trip-count loops, note that the loop-vectorizer itself may decide to do so, with interleaving.

Sure, will add comments explaining the logic behind turning on OptForSize in this case.

https://reviews.llvm.org/D34373