[llvm] [LV] optimize VF for low TC, when tail-folding (PR #91253)

Thu May 23 12:33:33 PDT 2024

davemgreen wrote:

> See the following example from the tree for masked loads:
> 
> ```llvm
> ; CHECK-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %v2f16 = call <2 x half> @llvm.masked.load.v2f16.p0(ptr undef, i32 8, <2 x i1> undef, <2 x half> undef)
> ; CHECK-NEXT:  Cost Model: Found an estimated cost of 19 for instruction: %v4f16 = call <4 x half> @llvm.masked.load.v4f16.p0(ptr undef, i32 8, <4 x i1> undef, <4 x half> undef)
> ; CHECK-NEXT:  Cost Model: Found an estimated cost of 39 for instruction: %v8f16 = call <8 x half> @llvm.masked.load.v8f16.p0(ptr undef, i32 8, <8 x i1> undef, <8 x half> undef)
> ```
> 
> VPlan makes the cost-modeling decisions which could return `CM_ScalarEpilogueAllowed` for scalar-epilogue, or `CM_ScalarEpilogueNotNeededUsePredicate` for tail-folding. Unless I'm very much mistaken, `getMaximizedVFForTarget` is a rule-of-thumb approximation, that does not query the CostModel. I think VPlan would have already determined that MVE with low TC is unprofitable to vectorize, and this code will never be called. ~If you have a suitable test case, that would be good.~ Let me just craft a test for MVE: will post it soon.

I'll try and add a test case. There are some examples in https://godbolt.org/z/3aa81Yaqj of things that I thought might be more expensive for smaller vector lengths, even if the costs for X86 don't always show it (to be fair I think that might be more about the loads/stores than the instructions between in that case). Those are unpredicated, and some of the codegen could definitely be better, but if you imagine with predicated loads/stores too small vector could be difficult to codegen.
MVE certainly has some peculiarities with it being low-power, but I don't think it is especially different other than it is a heavy use of (tail) predicated vectorization. Short vectors in general do not always get looked at as much as longer ones as they usually come up less.

My understanding is that the code in https://github.com/llvm/llvm-project/blob/a38f0157f2a9efcae13b691c63723426e8adc0ee/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp#L4885 handles the costs, and will account for the tripcount so long it is known (and the vectors are fixed-width). Perhaps that could be extended to scalable vectors in some way too?

https://github.com/llvm/llvm-project/pull/91253