[llvm] [LV] optimize VF for low TC, when tail-folding (PR #91253)

Thu May 23 00:56:08 PDT 2024

davemgreen wrote:

Sorry I didn't really want to be a block, I've been doing that too much lately.

But there shouldn't be a problem in using 3 out of 8 vector lanes in a predicated v8i16, if it still uses a single vector iteration. It is still the same number of operations thanks to the predication. Why limit it to v4i16 if the target, like MVE, is better at 128bit vectorization than 64bit?
On paper I'm not sure if there should be an advantage to limiting the VF for predicated loops, unless it was previously picking VFs that involved more instructions (more than the natural vector width), which then feels like it should be a cost modelling issue.

https://github.com/llvm/llvm-project/pull/91253