[PATCH] D150851: [LoopVectorize] Vectorize select-cmp reduction pattern for increasing integer induction variable

Sun Jul 2 15:07:51 PDT 2023

Ayal added a comment.

In D150851#4458987 <https://reviews.llvm.org/D150851#4458987>, @Mel-Chen wrote:

> [snip]
>
> I believe that the select-cmp reduction pattern can be classified into several types based on the selecting variable. Currently, I have categorized them as follows:

Would be good to try and find more accurate names than using `Select*Cmp` combinations. A Compare-Select pattern is also used in the existing Min/Max reduction, but has a much better name.

> 1. Select operand is a loop invariant, i.e., `Select[I|F]Cmp`. This has already been implemented in the D108136 <https://reviews.llvm.org/D108136> by @david-arm.

and should be renamed `[I|F]Any` or something else more accurate. The two invariant operands should be sunk and selected after the loop, according to the outcome if "any" were found or not.

> 2. Select operand is a monotonic increasing/decreasing induction variable, and the start value of the induction variable is not equal to the minimum/maximum value of the data type. This patch handles the case of signed increasing induction variables, while the case of decreasing induction variables is yet to be implemented. The decision to only handle signed variables depends on LLVM's design, the issue including the choice of sentinel values, and the selection of umax|smax reduction intrinsics. If the compiler architecture allows distinguishing between signed and unsigned, the unsigned induction variable case should be easily achievable.
>
> 3. Select operand is a monotonic increasing/decreasing induction variable, and there are no restrictions on the start value of the induction variable.
>
>   unsigned int red = start_value;
>   for (unsigned int i = 0; i < n; ++i)
>     red = (a[i] > b[i]) ? i : red;
>
> 4. Select operand is an any variable.
>
>   int red = start_value;
>   for (int i = 0; i < n; ++i)
>     red = (a[i] > b[i]) ? c[i] : red;
>
> Both 1) and 2) can be handled with a single reduction. On the other hand, 3) and 4) are more complex, and require two reductions to be completed.

Arguably, (2), (3) and (4) are essentially all `FindLast` reductions, interested in the last loop iteration i for which some predicate p(i) such as a[i]-b[i]>0 holds, plus an indicator if no such iteration was found, followed by some post-processing of these results: in (2), (3) and (4), if no such iteration was found then some invariant "start_value" is returned, regardless if it was originally out-of-bounds or not. In (4), in addition, if a loop iteration i was found satisfying p(i), some f(i) computation of the last such i should be returned, as in c[i]. This is analogous to sinking the invariants of case (1), and may indeed be more elaborate, but also covers simpler cases such as any (other) AddRec/IV that can be evaluated given i, e.g.,:  `red = (a[i] > b[i]) ? 3*i+8 : red;` - so chose any IV you prefer, e.g., one which has the desired no-wrap plus out-of-bound value, even if the original one does not.

In any case, it should be helpful to distill what actually needs to be maintained throughout the reduction loop, even if more appear there originally; be it boolean indicators in `Any` reductions (`cmp_red_part` in your example), indices in `Find` reductions (`select_red_part` in your example initialized with "unfound" out-of-bound indicators), or compound value+index in e.g. min/max-with-index reductions. The compiler can surely and does introduce new phi's as needed, hopefully having minimal width, but could also try to eliminate existing phi's and reduce the number of values that are live-out of the loop, possibly at the cost of replicating code, e.g., if c[i] is also used inside the loop.

Here's a sketch minimizing the size of the indices maintained throughout the loop, so they would avoid wrapping, provide out-of-bound values, and possibly use narrower types depending on trip-count and vl:

  return_type FindLast(return_type unfound_value, vec_predicate_func, found_func) {
    vec_unsigned_int select_red_part = splat(0); // Zero indicates unfound.
    vec_unsigned_int step_vec = splat(1); // Count vector iterations starting at 1.

    for (unsigned int i = 0; i < n; i+=vl, step_vec+=splat(1))
      select_red_part = (vec_predicate_func(i) ? step_vec : select_red_part;

    unsigned vec_indices_ored = reduce.or(select_red_part);
    if (vec_indices_ored == 0)
      return unfound_value;
    unsigned inflated_red_part = (select_red_part - splat(1)) * vl + <0,1,...,vl-1>;
    unsigned last_index = reduce.umax(inflated_red_part);
    return found_func(last_index);
  }

Regarding `FindFirst`, indeed the natural way of writing it with a break would provide the compiler with an uncountable loop that is harder to vectorize due to speculative execution, and so it is natural to start with `FindLast`. But if written as a countable loop the compiler might be able to vectorize and optimize it by introducing a break. As in an `[I|F]Any` countable loop that if free of any other side-effects, or a `FindLast` loop that can be reversed into a `FindFirst` loop moving backwards and breaking on first finding.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D150851/new/

https://reviews.llvm.org/D150851