[llvm] [LV] optimize VF for low TC, when tail-folding (PR #91253)

Thu May 23 02:58:59 PDT 2024

artagnon wrote:

See the following example from the tree for masked loads:

```llvm
; CHECK-NEXT:  Cost Model: Found an estimated cost of 9 for instruction: %v2f16 = call <2 x half> @llvm.masked.load.v2f16.p0(ptr undef, i32 8, <2 x i1> undef, <2 x half> undef)
; CHECK-NEXT:  Cost Model: Found an estimated cost of 19 for instruction: %v4f16 = call <4 x half> @llvm.masked.load.v4f16.p0(ptr undef, i32 8, <4 x i1> undef, <4 x half> undef)
; CHECK-NEXT:  Cost Model: Found an estimated cost of 39 for instruction: %v8f16 = call <8 x half> @llvm.masked.load.v8f16.p0(ptr undef, i32 8, <8 x i1> undef, <8 x half> undef)
```

VPlan makes the cost-modeling decisions much before we determine that tail-folding or scalar-epilogue is required. It does not have support to come up with a vectorization plan taking tail-folding or scalar-epilogue into consideration. Unless I'm very much mistaken, `getMaximizedVFForTarget` is a rule-of-thumb approximation, that does not query the CostModel. The case of MVE with low TCs requiring tail-folding or scalar-epilogue is an edge case, that I don't think this code is equipped to handle.

https://github.com/llvm/llvm-project/pull/91253