[PATCH] D150851: [LoopVectorize] Vectorize select-cmp reduction pattern for increasing integer induction variable

Fri Jul 7 01:48:58 PDT 2023

Mel-Chen added a comment.

In D150851#4467261 <https://reviews.llvm.org/D150851#4467261>, @Ayal wrote:

> Would be good to try and find more accurate names than using `Select*Cmp` combinations. A Compare-Select pattern is also used in the existing Min/Max reduction, but has a much better name.

Sure, perhaps @artagnon can join the discussion as well. Let me share my experience first: In GCC, I have seen a classification called `ExtractLast`, which has a similar semantics to what you mentioned as `FindLast`. Welcome further input and opinions.

>> 1. Select operand is a loop invariant, i.e., Select[I|F]Cmp. This has already been implemented in the D108136 <https://reviews.llvm.org/D108136> by @david-arm.
>
> and should be renamed [I|F]Any or something else more accurate. The two invariant operands should be sunk and selected after the loop, according to the outcome if "any" were found or not.

How about following the C++ STL, renaming it to `[I|F]AnyOf`? What do you think, @david-arm?

> Here's a sketch minimizing the size of the indices maintained throughout the loop, so they would avoid wrapping, provide out-of-bound values, and possibly use narrower types depending on trip-count and vl:
>
>   return_type FindLast(return_type unfound_value, vec_predicate_func, found_func) {
>     vec_unsigned_int select_red_part = splat(0); // Zero indicates unfound.
>     vec_unsigned_int step_vec = splat(1); // Count vector iterations starting at 1.
>   
>     for (unsigned int i = 0; i < n; i+=vl, step_vec+=splat(1))
>       select_red_part = (vec_predicate_func(i) ? step_vec : select_red_part;
>   
>     unsigned vec_indices_ored = reduce.or(select_red_part);
>     if (vec_indices_ored == 0)
>       return unfound_value;
>     unsigned inflated_red_part = (select_red_part - splat(1)) * vl + <0,1,...,vl-1>;
>     unsigned last_index = reduce.umax(inflated_red_part);
>     return found_func(last_index);
>   }

If we focus on removing the wrapping and bound restrictions, I think we can consider the approach proposed by @artagnon in D152693 <https://reviews.llvm.org/D152693>. This method cleverly extends the technique used by @david-arm in `SelectICmp`. The approach can be summarized as follows: 
Consider the loop:

  unsigned int red = start_value;
  for (unsigned int i = 0; i < n; ++i)
    red = (a[i] > b[i]) ? i : red;

vectorize to:

  unsigned int red = start_value;
  vec_unsigned_int red_part = splat(start_value);
  vec_unsigned_int step_vec = {0, 1, 2, ...};
  for (unsigned int i = 0; i < n; i+=vl) {
    red_part = (vec_a[i] > vec_b[i]) ? step_vec : red_part;
    step_vec += {vl, vl, vl, ...};
  }
  vec_bool ne_start_value = red_part != splat(start_value);
  bool may_update = reduce.or(ne_start_value);
  vec_unsigned_int masked_red_part = ne_start_value ? red_part : splat(DataTypeMin);
  red = may_update ? reduce.smax|umax(masked_red_part) : start_value;

While the conditions checked in this patch are more strict, I believe both approaches should coexist. In general, the IR generated by this patch should have better performance in the same case. Therefore, it should be prioritized when possible. However, when the cases that cannot be handled by this patch, we can apply the approach in D152693 <https://reviews.llvm.org/D152693>.
In addition, there is still room for optimization in this patch. We usually face source code like this:

  j = -1;
  for (int i = 0; i < n; i++) {
      if (a[i] < b[i]) {
          j = i;
      }
  }

When the start value of the reduction is a known constant and is known to be smaller than the start value of the increasing induction variable, we may not even need to use a sentinel value. Simply using the reduce max operation would suffice.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D150851/new/

https://reviews.llvm.org/D150851