[PATCH] D88819: [LV] Support for Remainder loop vectorization

Tue Oct 6 10:49:13 PDT 2020

mivnay added a comment.

In D88819#2314457 <https://reviews.llvm.org/D88819#2314457>, @bmahjour wrote:

> Thanks for working on epilogue vectorization. Incidentally I've also looked into this recently. There has been a long and detailed discussion on the mailing list from back in 2017 about this transformation here http://llvm.1065342.n5.nabble.com/llvm-dev-Proposal-RFC-Epilog-loop-vectorization-td106322.html. Your patch is able to vectorize epilogue loops with fairly small changes to the LV, however the generated CFG is not optimal. For example, while the SCEV and memory checks are not redundantly executed, they are statically duplicated in the code and increase code size unnecessarily. The trip count checks can also be generated in a way that shortens the critical path from the checks to the scalar loop, which is critical for loops that have a small trip count. Based on the followup discussions from the mentioned RFC, the optimal CFG should look more like what I've attached below.
>
> F13180829: Screen Shot 2020-10-06 at 10.16.04 AM.png <https://reviews.llvm.org/F13180829>

Thanks for looking into the patch. The idea is to not affect the performance of the original vectorization too much even when the epilog vectorization has happened. The CFG you suggested seems to have epilog trip count check first even if trip count is good enough for original vector loop.

I think optimal CFG is all about profiling information. I ran SPEC CPU2017 benchmark with the current change and did not see any regression even though many loops got transformed. It gained in one of the benchmarks.

> The trip count checks can also be generated in a way that shortens the critical path from the checks to the scalar loop, which is critical for loops that have a small trip count.

This approach doesn't work when most of the trip counts are always good for original vector loop.  In fact, it even performs one additional trip count check when both vector loop and epilog vector loops are executed. For example, if original VF=16 and UF=2, and epilog VF=8 and UF=1, **trip count as small as 40 requires 3 trip count checks**. Where as, it is 2 in the current implementation.

> For example, while the SCEV and memory checks are not redundantly executed, they are statically duplicated in the code and increase code size unnecessarily.

This optimization is disabled for -Osize. Redundant runtime check blocks can only be avoided when epilog vector loop trip count checks are done first. But it looks like code size vs performance trade-off.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D88819/new/

https://reviews.llvm.org/D88819