[PATCH] D88819: [LV] Support for Remainder loop vectorization

Thu Oct 15 08:07:02 PDT 2020

mivnay added a comment.

I will try to summarize the current changes done in the below C code and also try to answer some of the common questions raised.

Before Loop Vectorization:

  #include <stdint.h>

  void func(int8_t *A, int8_t *B, int8_t *C, int N) {
    for (int I = 0; I < N; ++I)
      A[I] = B[I] + C[I];
  }

After Loop Vectorization (with epilog disabled):

  // After Loop Vectorization
  void func1(int8_t *A, int8_t *B, int8_t *C, int N) {
    int I1;
    int VF1;
    bool alias_check_1;
    bool scev_check_1;

    if (N >= VF1) { // iteration_check_1
      if (!alias_check_1)
        goto SCALAR_LOOP;

      if (!scev_check_1)
        goto SCALAR_LOOP;

      // vector_loop_1
      for (I1 = 0; I1 <= N; I1 += VF1)
        A[I1:(I1 + VF1 - 1)] = B[I1:(I1 + VF1 - 1)] + C[I1:(I1 + VF1 - 1)];

      goto SCALAR_LOOP_WITH_CHECK;
    } else
      goto SCALAR_LOOP;

  SCALAR_LOOP_WITH_CHECK:
    if (N - I1 > 0) { // remainder_iteration_check_1
    SCALAR_LOOP:
      for (int I = I1; I < N; ++I)
        A[I] = B[I] + C[I];

      goto EXIT;
    } else
      goto EXIT;

  EXIT:
    return;
  }

After Epilog Loop Vectorization:

  void func2(int8_t *A, int8_t *B, int8_t *C, int N) {
    int I1 = 0, I2;
    int VF1, VF2;
    bool alias_check_1, alias_check_2;
    bool scev_check_1, scev_check_2;
    bool is_vector_loop_executed = false;

    if (N >= VF1) { // iteration_check_1
      if (!alias_check_1)
        goto SCALAR_LOOP; // optimization_1

      if (!scev_check_1)
        goto SCALAR_LOOP; // optimization_1

      // Vector Loop
      for (I1 = 0; I1 <= N; I1 += VF1)
        A[I1:(I1 + VF1 - 1)] = B[I1:(I1 + VF1 - 1)] + C[I1:(I1 + VF1 - 1)];
      is_vector_loop_executed = true;
      goto EPILOG_LOOP_ENTRY_WITH_CHECK;
    } else
      goto EPILOG_LOOP_ENTRY;

  EPILOG_LOOP_ENTRY_WITH_CHECK:
    if (N - I1 == 0) { // remainder_iteration_check_1
      goto EXIT;
    }

  EPILOG_LOOP_ENTRY:     // I1 is mostly 0 here and ignored in the actual code.
    if (N - I1 >= VF2) { // iteration_check_2

      if (!is_vector_loop_executed) { // optimization_2
        if (!alias_check_2)
          goto SCALAR_LOOP;

        if (!scev_check_2)
           goto SCALAR_LOOP;
      }
      // Epilog Vector Loop
      for (I2 = N - I1; I2 <= N; I2 += VF2)
        A [I2:(I1 + VF2 - 1)] = B [I2:(I1 + VF2 - 1)] + C [I2:(I1 + VF2 - 1)];

      goto SCALAR_LOOP_WITH_CHECK;
    } else
      goto SCALAR_LOOP;

  SCALAR_LOOP_WITH_CHECK:
    if (N - I2 > 0) { // remainder_iteration_check_2
    SCALAR_LOOP:
      for (int I = I2; I < N; ++I)
        A[I] = B[I] + C[I];

      goto EXIT;
    } else
      goto EXIT;

  EXIT:
    return;
  }

NOTE: 

1. function names are changed just for the reference purpose.
2. VF1 is the vectorization factor, VF2 is the epilog vectorization factor.
3. The SCEV,alias and iteration checks may not be present for all the vectorized loops.
4. `is_vector_loop_executed` is actually implemented as PHI node.

Why epilog loop vectorization?
------------------------------

There are two kinds of cases where it benefits:

1. The remainder iterations after original vectorization is too huge and there is an opportunity to vectorize.

  Example: For i8 types, if the VF is 16 and trip count is 24. Epilog vectorization of VF=8 makes perfect sense.

2. The original trip count itself is small.

  Example: Original vectorization itself generates VF = 16. But trip count is 8 for i8 types.

We are trying to cover both of the cases in this patch.

On what basis the order of checks were decided?
-----------------------------------------------

The order was decided based on the profile information from the current candidates we have in SPEC CPU 2017. We did not find any regressions with the current order.
Also, the current order of checks do not disturb the original vectorization flow even if epilog vectorization is done except for the epilog loop iteration check (iteration_check_2 in func2()).

Why not re-rerun the vectorizer?
--------------------------------

Short answer:

Re-running the vectorizer is not optimal.

Long answer:

We have the runtime checks in both vector loop and epilog vector loop. It is needed because the iteration check for original VF (VF1 in func2) might fail and directly go to epilog loops (EPILOG_LOOP_ENTRY). So, there may be a possibility that original SCEV and alias checks may not get executed and directly go to epilog vector loop.

There are two optimizations which are done to avoid re-running the checks:

a. optimization_1 in func() : If any of SCEV and alias checks fails in the original vector loop, directly go to SCALAR_LOOP (instead of EPILOG_LOOP_ENTRY as in case of re-running the vectorizer)
 b. optimizaiton_2 in func(): If the vectorizer executes the tests and passes it, do not run them again in epilog vectorizer.

Re-running the vectorizer again would not give us access to all these checks in CFG. That is why the changes are done inside the InnerLoopVectorizer. I don't see any optimizations eliminating the redundant blocks after blindly re-running the vectorizer. It has been discussed before in the older RFC as well.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D88819/new/

https://reviews.llvm.org/D88819