[PATCH] D116292: [OMPIRBuilder][MLIR] Support ordered clause specified with parameter

Fri Jan 28 23:15:20 PST 2022

peixin added a comment.

Thanks @Meinersbur for the review and good comments.

In D116292#3280853 <https://reviews.llvm.org/D116292#3280853>, @Meinersbur wrote:

>> With this change, lowering to LLVM IR is supported when dynamic schedule
>> is specified and collapse value is greater than 1. Also add the test
>> case.
>
> Could you explain what goes bad when you do not do this?

For dynamic schedule, it overrides the `afterIP`. But when collapse value is greater than 1, it should use the `afterIP` stored before transforming the collapsed loops. You can check the changes of this patch in `mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp`.

> In D116292#3210335 <https://reviews.llvm.org/D116292#3210335>, @peixin wrote:
>
>> Clang transforms the doacross loop nest into a new one with lower bound of 0 and step of 1. However, this is really not necessary. OpenMP runtime can handle the doacross loop nest regardless of positive or negative step https://github.com/llvm/llvm-project/blob/7c3cf4c2c0689be1a08b8a1326703ec5770de471/openmp/runtime/src/kmp_csupport.cpp#L4050-L4058. The doacross loop nest is independent of worksharing-loop.
>
> While it does contain code for it, it is also wrong in edge case:
>
> 1. If lo is larger than up (I assume there must be check for this somewhere, but I don't who is responsible for checking it; the compiler-emitted code?)

Do you mean lo is greater than up and step is positive? Normalization don't check this case, either. I think that is the problem of the user code.

> 2. If `lo - up` overflows, in particular if the loop counter variable itself is `int64_t`. The trip count itself doesn't even need to be large if the increment value is large as well.

If `lo - up` overflows, computing the trip count also overflows. The worksharing-loop seems not to support it.

> 3. Potentially if the loop counter variable is `uint64_t` and lo/up larger than 2^63.

For `flang`, the type of loop counter variable must be int32_t or int64_t, and there is no signed integer 64-bit in fortran. For `clang`, the `uint64_t` is converted into `int64_t` and there is no `uint64_t` passed to `__kmpc_doacross_init`. For example, the lower bound is -1 if it is declared as `unsigned long long lb = ULLONG_MAX;`. In this case, the computation here is `trace_count = (uint64_t) (-1 - 1) / 1 + 1 = ULLONG_MAX` for the statement `for (unsigned long long i = ULLONG_MAX; i >= 1; i--)` and it is correct.

> An integer loop variable must be introduced anyway for loops over iterators, we might just as well normalize everything to a simplified logical iteration space and not have to bother with overflows later.

For `clang`, the variable can be normalized according the operators in the for loop such as `<, >, <=, >=, !=`. But for `flang`, it is not easy to know if lb is greater than ub or not. Let's look at the following example,

  !$omp do ordered(1)
  do i = lb, up, step
    !$omp ordered depend(i-1)
    func(i-1)
    ...
  enddo

For Ordered Depend directive, how to transform `i-1` for the argument of `__kmpc_doacrosss_wait` is hard to know. Actually, normalization in clang is not correct in all cases. I found one bug as follows:

  #include <iostream>
  using namespace std;

  int main() {
    int i, i_lb = 1, i_ub = 10, i_step = 1;
    int a[10];

    for (i = 0; i < 10; i++)
      a[i] = 1;

    #pragma omp parallel num_threads(9)
    #pragma omp for ordered(1)
    for (i = i_lb; i != i_ub; i = i + i_step) {
      #pragma omp ordered depend(sink: i-1)
      a[i] = a[i-1] + 1;
      #pragma omp ordered depend(source)
    }

    for (i = 0; i < 10; i++)
      cout << a[i] << " ";
    cout << endl;
    return 0;
  }

  $ clang++ case.cpp && ./a.out
  1 2 3 4 5 6 7 8 9 10 
  $ clang++ case.cpp -fopenmp && ./a.out
  1 1 1 1 1 1 1 1 1 1 

The problem is `clang` thinks that i_lb is greater than i_ub in this case and the normalization of `depend(sink: i-1)` is wrong.

For other comments, will fix them later.

================
Comment at: llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h:424
+  ///                     loops.
+  void applyDoacrossLoop(DebugLoc DL, InsertPointTy AllocaIP,
+                         BasicBlock *PreHeaderBB, BasicBlock *ExitBB,
----------------
Meinersbur wrote:
> Did you consider making doacross part of an existing call like applyDynamicWorkshareLoop? What are the reason against it? If is a potential `collapseLoop` that loses information of the dimensionality of the original loop, did you consider adding that information to `CanonicalLoopInfo` such that it can be preserved?
The `doacrossloop` only inserts the init and fini calls into current worksharing-loop. If making it like applyDynamicWorkshareLoop, three are needed, i.e., applyDoacrossDynamicLoop, applyDoacrossStaticLoop, and applyDoacrossStaticChunkLoop, which is too redundant.

Worksharing loop is commonly used, but ordered(n) is not commonly used. In some workloads, there is even no ordered(n) clause. Adding doacorss loop info into CanonicalLoopInfo will have some cost, which is not necessary. What do you think?

================
Comment at: mlir/test/Dialect/OpenMP/invalid.mlir:123
   // expected-error at +1 {{ordered is not a valid clause for the omp.parallel operation}}
-  omp.parallel ordered(2) {}
+  omp.parallel ordered(0) {}
 }
----------------
Meinersbur wrote:
> Why this change?
The doacross loop is not implemented before. If the ordered value is greater than 1, there is one virtual doacross clause attached with this patch. This check only checks if there is ordered clause. 

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D116292/new/

https://reviews.llvm.org/D116292