[PATCH] D34373: [LV] Optimize for size when vectorizing loops with tiny trip count

Tue Jun 20 15:33:30 PDT 2017

hfinkel added a comment.

In https://reviews.llvm.org/D34373#785678, @twoh wrote:

> In https://reviews.llvm.org/D34373#785642, @hfinkel wrote:
>
> > In https://reviews.llvm.org/D34373#785485, @twoh wrote:
> >
> > > In https://reviews.llvm.org/D34373#785367, @hfinkel wrote:
> > >
> > > > In https://reviews.llvm.org/D34373#784975, @twoh wrote:
> > > >
> > > > > I think this is a right approach, but concerned that the experimental results I shared on https://reviews.llvm.org/D32451 show that it is generally better to not to vectorize the low trip count loops. @Ayal, I wonder if you have any results that this patch actually improves the performance. Thanks!
> > > >
> > > >
> > > > I know that we're currently missing opportunities for large vectorizable loops with low (static)  trip counts. Smaller inner loops are also good candidates for unpredicated vectorization, but we may need to be a bit careful because of modeling inaccuracies and phase-ordering effects (e.g. if we don't vectorize a loop, then we'll end up unrolling it when the unroller runs).
> > >
> > >
> > > Got it. My concern was for small single-level loops with low trip counts, as I observe them pretty frequently. I have no objection accepting this patch and improve the cost estimator separately.
> >
> >
> > Do you mean that you see such loops frequently with dynamically-small trip counts, or with static trip counts? I assume the small loops with (small) static trip counts will generally be unrolled.
>
>
> Actually you're right. The case I observed was a small static trip count loop completely unrolled and SLP vectorized which actually harms the performance, but not with LV. I think this patch should work if it effectively targets loops that are large enough to not to be unrolled.

I agree. Given that this will only vectorize loops that don't need a remainder loop, even if it would otherwise be unrolled, that should be fine. As you might be pointing out with you observation about SLP vectorization sometimes hurting performance, there certainly are cases where vectorization of small numbers of instructions can harm performance on OOO cores (for example, because they introduce additional data dependencies that might be more harmful than the corresponding increase in parallelism). It seems possible that the code for a small loop that comes out of the LV might have the same issue (if, for example, we generate unaligned vector loads, or access strided data and then shuffle it together). However, it is not clear to me that this will be the case. The SLP vectorizer has a minimum tree height of three, and for the LV to produce a loop that unrolls to less than three instructions, I assume it would need to essentially be a memcpy. I suspect that we'll need to try it and see if we find regressions.

https://reviews.llvm.org/D34373