[llvm] [LV] Pure runtime check for minimum profitable trip count. (PR #115833)
David Sherwood via llvm-commits
llvm-commits at lists.llvm.org
Mon Nov 18 05:59:27 PST 2024
david-arm wrote:
> > Just for information I tried this patch out with the x264 benchmark on neoverse-v1 and it causes a ~2% performance regression. It looks like in the hot loop in `mc_chroma` we are not entering the tail-folded vector loop as often, and falling back on the scalar tail. I guess that means the min profitable trip count isn't quite right.
>
> I got the min prof TC is 5 in function `mc_chroma`, did you too? So far, we think there may be something need to fix when calculating min prof TC. Maybe this
>
> ```
> // Second, compute a minimum iteration count so that the cost of the
> // runtime checks is only a fraction of the total scalar loop cost. This
> // adds a loop-dependent bound on the overhead incurred if the runtime
> // checks fail. In case the runtime checks fail, the cost is RtC + ScalarC
> // * TC. To bound the runtime check to be a fraction 1/X of the scalar
> // cost, compute
> // RtC < ScalarC * TC * (1 / X) ==> RtC * X / ScalarC < TC
> uint64_t MinTC2 = divideCeil(RtC * 10, ScalarC);
> ```
>
> Anyway, we will do more experiments and update the min prof TC if it is need.
OK thank you for investigating! So when targeting SVE on neoverse-v1 we get the benefit of tail-folding without any profile information. I think you might be right about the lowest trip count, but I believe the average trip count is higher than that. However, x264 is also a special case because for some of the loops we rewrite the runtime memory checks for the inner loop to cover the entire outer loop. I updated the cost model a while ago in `GeneratedRTChecks::getCost` to reflect this. I'm not sure if this is happening in `mc_chroma` or not.
https://github.com/llvm/llvm-project/pull/115833
More information about the llvm-commits
mailing list