[PATCH] D115261: [LV] Disable runtime unrolling for vectorized loops.

Fri Sep 2 01:14:43 PDT 2022

dmgreen added a comment.

In D115261#3761027 <https://reviews.llvm.org/D115261#3761027>, @fhahn wrote:

> In D115261#3760460 <https://reviews.llvm.org/D115261#3760460>, @dmgreen wrote:
>
>> There is an example in https://godbolt.org/z/8YnsWfGGo where.. something is going wrong.
>
> Yeah this is a case where the current logic to eliminate single-iteration vector loops doesn't trigger. Should be fixed by D133017 <https://reviews.llvm.org/D133017>

Thanks

>> This looks like it disabled full unrolling after vectorization too? I think it is fairly important for performance to be able to simplify loops away where the runtime trip count is small enough by unrolling them. Either where the loop count becomes constant propagated through LTO or the trip count is just low enough. As far as I understand straight line code will almost always be better that looping, and preventing full unrolling would lead to performance regressions.
>
> The above patch doesn't solve the issue where the loop will need unrolling twice or more times to completely remove the loop overhead. This shouldn't be an issue for X86/most AArch64 cores and full unrolling leads to excessive unrolling as in https://github.com/llvm/llvm-project/issues/42332.  But it  may still be an issue for some other architectures. So maybe targets should be able to opt-in/out?

Thanks for the extra context. The reasoning given (micro-op buffers) does sound very (micro-)architectural. I'm surprised if the loop body is the bottleneck that decode couldn't keep up with a fully unrolled version. A (relatively small) number of vector operations in a single basic block will often be faster than a loop. Just having the loop is going to cause a certain amount of overhead.

Disabling extra runtime unrolling after vectorization makes sense - otherwise the loop body can get too big and end up never executed. We do the same thing already on certain targets. There will be places where it is worse for performance, but the benefits are likely to be more common.

But full unrolling sounds like it should be mostly beneficial, due to the extra simplification it can provide. We are going from too much unrolling to too little. I don't think I have anything that shows it in the benchmarks I usually run, but the cases I've heard from customers in the past were DPS routines like those from https://github.com/ARM-software/CMSIS-DSP/tree/main/Source. They can get compiled with LTO, so during the normal compile step they are vectorized with unknown trip counts. During LTO they get inlined or const-propagated and a lot of the loop bounds become constant. It is expected that the compiler can simplify the result nicely, including the predicated vector loops we might have produced (which is why patches like https://reviews.llvm.org/D94103 were useful).

So I have no evidence of full unrolling being a problem, but my gut says that if it is useful to unroll scalar loops then vector loops shouldn't really be treated any differently.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D115261/new/

https://reviews.llvm.org/D115261