[llvm] [LoopVectorize] Don't discount instructions scalarized due to tail folding (PR #109289)

Tue Oct 22 07:11:47 PDT 2024

john-brawn-arm wrote:

> For the lowtrip and optsize cases the "icmp ult" is being sunk by the CodeGenPrepare pass into the blocks below, which is why the resulting assembly ends up being terrible. However, if we calculated the icmp just once in the loop header and then did an extractelement for each predicated block we'd end up with the same efficient code as the conditional function. In summary, I think we should instead fix the codegen for the lowtrip case. If we do that, then I expect the performance to be better than a scalar loop.

I don't think this is correct. In the function "conditional" without vectorizing the "i & 0x1" results in 16 comparisons (one for each iteration), and vectorizing reduces it to two vector comparisons, so vectorizing reduces the number of comparison instructions. In "lowtrip" the tail folding introduces two vector comparisons where we had no comparison before, so vectorizing increases the number of comparison instructions.

I went and measured this on cortex-a53, and the total time for 128 iterations of the lowtrip function is:
no vectorization: 7531
vectorization by 8: 28750
vectorization by 8 with codegenprepare hacked to not sink the cmp: 9663

> In this case I agree with Florian that the best solution to the bug you're trying to fix is to limit the scope of the patch to just -Os and fix this before going through the cost model. The change in this PR is being applied to more than just programs compiled with -Os, but also for lowtrip loops or targets that prefer tail-folding. It's not obvious to me that always preferring a scalar loop is the right way to go without further analysis of the impact.

This patch doesn't always prefer a scalar loop. It fixes an error in the cost calculation, which leads to us choosing a scalar loop when it's faster than a vector loop. My example in the comment above (https://github.com/llvm/llvm-project/pull/109289#discussion_r1804735580) shows a case where with this patch we still vectorize, as the calculated cost shows that it's faster than a scalar loop.

https://github.com/llvm/llvm-project/pull/109289