[PATCH] D150851: [LoopVectorize] Vectorize select-cmp reduction pattern for increasing integer induction variable

Thu Jun 29 03:08:04 PDT 2023

Mel-Chen added a comment.

In D150851#4440545 <https://reviews.llvm.org/D150851#4440545>, @Ayal wrote:

> A selecting reduction could report the index of the value reduced in addition to the value itself. If the reduced value appears multiple times, the index of the first or last appearance can be reported. Tests for such `MinLast` cases, aka argmin, were introduced in 4f04be564907f <https://reviews.llvm.org/rG4f04be564907fb7ddff8ebc7773b892a93b00f2e>, and are yet to be vectorized by LV - hope this patch helps us get there! These are compound patterns combining three header phi's: an induction and two reductions.

Yes, this patch was separated from the D143465 <https://reviews.llvm.org/D143465> based on @fhahn's suggestion. D143465 <https://reviews.llvm.org/D143465> still need to be refined, so it hasn't been invited to more reviewers yet.

> This `FindLast` patch currently meets these requirements by restricting to indices that are increasing, signed, and start from a non-min-signed value. It seems unnatural for such indices to wrap, or if PSCEV guards against AddRec wrapping in general(?), but even if an index may wrap and/or does not provide desired out-of-bounds values, a designated IV counting **vector** iterations could be used from which the original indices can later be reconstructed in the epilog and reduced. Such an IV is immune to wrapping and provides out-of-bound values. This is one of several possible ways to lift these restrictions.

This comment inspired me deeply. Let me share my thoughts and plans regarding the select-cmp reduction pattern (referred to as `[I|F]Any` mentioned in your comment) .

I believe that the select-cmp reduction pattern can be classified into several types based on the selecting variable. Currently, I have categorized them as follows:

1. Select operand is a loop invariant, i.e., `Select[I|F]Cmp`. This has already been implemented in the D108136 <https://reviews.llvm.org/D108136> by @david-arm.

2. Select operand is a monotonic increasing/decreasing induction variable, and the start value of the induction variable is not equal to the minimum/maximum value of the data type. This patch handles the case of signed increasing induction variables, while the case of decreasing induction variables is yet to be implemented. The decision to only handle signed variables depends on LLVM's design, the issue including the choice of sentinel values, and the selection of umax|smax reduction intrinsics. If the compiler architecture allows distinguishing between signed and unsigned, the unsigned induction variable case should be easily achievable.

3. Select operand is a monotonic increasing/decreasing induction variable, and there are no restrictions on the start value of the induction variable.

  unsigned int red = start_value;
  for (unsigned int i = 0; i < n; ++i)
    red = (a[i] > b[i]) ? i : red;

4. Select operand is an any variable.

  int red = start_value;
  for (int i = 0; i < n; ++i)
    red = (a[i] > b[i]) ? c[i] : red;

Both 1) and 2) can be handled with a single reduction. On the other hand, 3) and 4) are more complex, and require two reductions to be completed.

Although all select-cmp reduction patterns can be vectorized using the vectorization approach in 4), for performance, I believe that the cases in 1) and 2) should be handled with a single reduction first. Therefore, when identifying and classifying the `RecurKind` for select-cmp reduction patterns, it is preferable to first consider whether they can be handled with cases 1) or 2), and then consider whether cases 3) or 4) need to be applied.

Next, let's discuss cases 3) and 4), which have not been implemented yet.

For case 3), I currently have two approaches to solve it. The first approach is to perform reduction not only on the select part, but also on the boolean value of the cmp operation.

  unsigned int red = start_value;
  vec_bool cmp_red_part = splat(false);
  vec_unsigned_int select_red_part = splat(DTypeMin);
  vec_unsigned_int step_vec = {0, 1, 2, ...};
  for (unsigned int i = 0; i < n; i+=vl) {
    cmp_red_part = cmp_red_part | (vec_a[i] > vec_b[i]);
    select_red_part = (vec_a[i] > vec_b[i]) ? step_vec: select_red_part;
    step_vec += {vl, vl, vl, ...};
  }
  bool cmp_red = reduce.or(cmp_red_part);
  red = cmp_red ? reduce.smax|umax(select_red_part) : start_value; 

The second approach is to directly use the vectorization approach in 4) to vectorize case 3).

  int red = start_value;
  vec_unsigned_int iter_red_part = splat(0);
  vec_unsigned_int red_part = splat(start_value);
  vec_unsigned_int step_vec = {0, 1, 2, ...};
  for (int i = 0; i < n; i+=vl) {
    iter_red_part = (vec_a[i] > vec_b[i]) ? step_vec : iter_red_part;
    red_part = (vec_a[i] > vec_b[i]) ? vec_c[i] : red_part;
    step_vec += {vl, vl, vl, ...};
  }
  unsigned int iter_red = reduce.umax(iter_red_part);
  mask_bool red_mask = (iter_red_part == splat(iter_red));
  red = reduce.or(red_part, red_mask);  // unsure about which reduction operation would be best for the extracting the result at the position red_mask indicated so far

Both approaches require two reductions, and one of the reductions will be a reduction phi that does not appear in the original user code. In other words, the vectorizer needs to have the capability to create a new reduction phi.

These are my thoughts on the select-cmp reduction pattern so far.

> Note that `Any` reductions reporting the first index can terminate once "true" is encountered, but seem more cumbersome to write (w/o a break), e.g.,:
>
>   // FindFirst w/o break.
>   int red = ii;
>   int red_set = false;
>   for (int i = 0; i < n; ++i)
>     if (a[i] > b[i]) {
>       red = red_set ? red : i;
>       red_set = true;
>     }
>
> instead of
>
>   // FindLast.
>   int red = ii;
>   for (int i = 0; i < n; ++i)
>     red = (a[i] > b[i]) ? i : red;
>
> A `FindLast` loop could be optimized into a `FindFirst` one by reversing the loop.

Interesting, I haven't thought about `FindFirst` yet. If it includes a break statement, it will be another long story - uncountable loop vectorization. Although I haven't deeply considered the `FindFirst` case, I still have some rough ideas to share.

Perhaps we can simplify the `FindFirst w/o break` example to:

  // FindFirst w/o break.
  int red = ii;
  int red_set = false;
  for (int i = 0; i < n; ++i) {
    if ((a[i] > b[i]) && !red_set)   // reduction 1
      red = i;
    if (a[i] > b[i]) // reduction 2
      red_set = true;
  }

In this way, we can clearly see that there are two reductions involved, and the result of one reduction will be masked by the result of the other reduction. This is very interesting, and may similar with the pattern in D143465 <https://reviews.llvm.org/D143465>. 
If we can transform the code into:

  // FindFirst w/o break.
  int red = ii;
  int red_set = false;
  for (int i = 0; i < n; ++i){
    if ((a[i] > b[i]) && !red_set)   // reduction 1
      red = i;
    red_set = red_set | (a[i] > b[i]);  // reduction 2
  }

, perhaps it will lead to better optimization results.

================
Comment at: llvm/test/Transforms/LoopVectorize/select-min-index.ll:89
+; CHECK-VF4IC1-NEXT:  entry:
+; CHECK-VF4IC1-NEXT:    br i1 true, label [[SCALAR_PH:%.*]], label [[VECTOR_PH:%.*]]
+; CHECK-VF4IC1:       vector.ph:
----------------
Ayal wrote:
> This test now gets vectorized, being a `FindLast` loop that reports the last index where a[i] < a[i-1]+1, or zero if none are found. (I.e., proving that a sequence is not strictly increasing, rather than computing `MinLast`.)
> But the vector loop is never reached?
Impressive catch! 
We have been focusing only on the vector.body and ignoring the others. I will prioritize clarifying this bug and fixing it as soon as reasonable.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D150851/new/

https://reviews.llvm.org/D150851