[llvm] [LV][VPlan] Add initial support for CSA vectorization (PR #121222)
via llvm-commits
llvm-commits at lists.llvm.org
Tue Feb 11 02:59:05 PST 2025
ayalz wrote:
@michaelmaitland - thanks for the intriguing discussion!
> @ayalz, we are carrying future patches which should improve the performance of the loop by using mask logic instead of reductions inside the loop. I planned on posting it as a follow up since it is really more performant in cases where the target core has mask units that can handle it well. It looks something like this:
>
> ```
> int t = init_val;
> <VF x i1> vmask = 0;
> <VF x ?> va;
> for (int i = 0; i < N; i+=VF) {
> vmaski = cond[i:i+VF-1];
> vmask = (vmsbf(vmaski) & vmask) | vmaski
> vai = a[i:i+VF-1]
> va = vmerge vmaski, vai, va
> }
> if any(vmask) {
> i = last(vmask)
> t = extract (va, i)
> }
> s = t; // use t
> ```
>
> This is not the same as a FindLast inside the loop because there is no reducing on each loop iteration. Since this pattern is not an extension of "FindLast", I'm not sure it is a good idea to develop CSAs as in loop reductions.
It may be helpful to distinguish between:
1. the *input* pattern - as it appears in the IR given to the vectorizer;
2. the *canonical* pattern - underlying essential construct recognized during vectorization, and
3. the *output* pattern - as it appears in the IR generated by the vectorizer.
The canonical pattern for the motivating example, presented with the following input pattern:
```
int t = init_val;
for (int i = 0; i < N; i++) {
if (cond[i])
t = a[i];
}
s = t; // use t
```
is arguably that of a FindLast reduction as the canonical pattern. Indeed, an alternative input pattern is, using a sentinel:
```
int t = init_val;
int last = -1;
for (int i = 0; i < N; i++) {
if (cond[i])
last = i;
}
s = (last == -1) ? init_val : a[last];
```
Hopefully these two input patterns could be recognized and manipulated using common infrastructure including recipes, and benefit from similar output patterns. Now, regarding the latter - multiple options may indeed be considered, with the one having the best cost to be selected. The above example demonstrates one such option, where
`vmask = (vmsbf(vmaski) & vmask) | vmaski`
is an optimized version of
`vmask = any(vmaski) ? vmaski : vmask`
with a slight deviation - if `any(vmaski)` then any bit prior to the last active bit of `vmaski` may be turned on due to `vmask`, but such bits are irrelevant (pls correct if needed).
As mentioned above, it may indeed be "better to pass vmask as a live-out and sink looking for its last turned-on lane to after
the loop, instead of looking for it inside the loop and passing i as live-out" - when considering how to best optimize a FindLast pattern; this includes how to best maintain vmask inside the loop.
Now, whether it is better to sink a single scalar load of `a[last]` out of the loop, or maintain a vector `va` inside the loop eventually holding it, via
```
vai = a[i:i+VF-1]
va = vmerge vmaski, vai, va
```
are also two options to chose from based on cost, unless one clearly outweighs the other.
>
> I think we're saying the same thing though, where we both want to see FindLast in the middle block. This is essentially what I'm doing with the ExtractRecipe. I wrote the ExtractRecipe since FindLast only supports monotonics. I still think that we need to keep the in-loop approach though, regardless of whether we use FindLast outside the loop or not?
>
> > I think such patterns are essentially extensions of "FindLast" reduction and should be developed as such, rather than being considered distinct unrelated patterns.
>
> @Mel-Chen can you chime in here? Can FindLast handle non-monotonic cases? I think the reason we took the approach proposed in this patch was because FindLast only works for monotonic cases.
FindLast conceptually computes that last iteration for which some condition holds, and iteration count is conceptually monotone and increasing (wraparound at the end issue?). Computing FindLast using a sentinel value, however, requires that such a value exists. In its absence, an external "found" indicator can be used or the condition can be checked for the resulting iteration, as @Mel-Chen pointed out in several TODOs of #67812. Note that decreasing IV's are derived from the canonical iteration count IV, and their derivative function could be sunk as well. In any case, if the current limitation of FindLast regarding monotone/sentinel is too restrictive, would be good to work towards lifting it?
https://github.com/llvm/llvm-project/pull/121222
More information about the llvm-commits
mailing list