[llvm] [LV] Pure runtime check for minimum profitable trip count. (PR #115833)

Mon Nov 18 05:59:27 PST 2024

david-arm wrote:

> > Just for information I tried this patch out with the x264 benchmark on neoverse-v1 and it causes a ~2% performance regression. It looks like in the hot loop in `mc_chroma` we are not entering the tail-folded vector loop as often, and falling back on the scalar tail. I guess that means the min profitable trip count isn't quite right.
> 
> I got the min prof TC is 5 in function `mc_chroma`, did you too? So far, we think there may be something need to fix when calculating min prof TC. Maybe this
> 
> ```
>     // Second, compute a minimum iteration count so that the cost of the
>     // runtime checks is only a fraction of the total scalar loop cost. This
>     // adds a loop-dependent bound on the overhead incurred if the runtime
>     // checks fail. In case the runtime checks fail, the cost is RtC + ScalarC
>     // * TC. To bound the runtime check to be a fraction 1/X of the scalar
>     // cost, compute
>     //   RtC < ScalarC * TC * (1 / X)  ==>  RtC * X / ScalarC < TC
>     uint64_t MinTC2 = divideCeil(RtC * 10, ScalarC);
> ```
> 
> Anyway, we will do more experiments and update the min prof TC if it is need.

OK thank you for investigating! So when targeting SVE on neoverse-v1 we get the benefit of tail-folding without any profile information. I think you might be right about the lowest trip count, but I believe the average trip count is higher than that. However, x264 is also a special case because for some of the loops we rewrite the runtime memory checks for the inner loop to cover the entire outer loop. I updated the cost model a while ago in `GeneratedRTChecks::getCost` to reflect this. I'm not sure if this is happening in `mc_chroma` or not.

https://github.com/llvm/llvm-project/pull/115833