[PATCH] D81036: [OpenMP] Begin Adding Analysis Remarks for OpenMP Best Practises.

Wed Jun 3 12:05:44 PDT 2020

jdoerfert added a comment.

In D81036#2071208 <https://reviews.llvm.org/D81036#2071208>, @jhuber6 wrote:

> Making minor adjustments.
>
> Any recommendations on which best practices to target next? I'm not intimately familiar with OpenMP best practices so I'm not sure which things would qualify.

We can do one that "in theory" we will optimize soon:
Look for pointer arguments of the outlined parallel body function that are marked readonly (except the first two). For those, tell people that these should be passed by value via `firstprivate`.

You can also look for a `__kmpc_barrier` just before the end of a parallel region, e.g., in

  #pragma omp parallel
  {
     some code
     #pragma omp for
     for (...)
        ...
  }

I also think we should improve the parallel annotation in a follow up step, that is, look at the body and avoid remarks if the usage of a constant thread count seems reasonable. There is also the case where the constant thread count might be reasonable but we can ask for further information, e.g.,

  #pragma omp parallel for num_threads(4)
  for (int i = 0; i < N; ++i)
    ...

We could tell the user that a constant thread count with an unknown loop trip count might not be what they want. Either make the thread count variable (or remove it), or provide information about the loop bound via operations like `__builtin_assume(N % 4 == 0)`, or `__builtin_assume(N = 4 * k)`.

In D81036#2071289 <https://reviews.llvm.org/D81036#2071289>, @lebedev.ri wrote:

> Bikeshedding: do we really want to do this here?
>  If we are not worried about the cases where the thread count was a variable originally,
>  but the variable eventually got folded into a constant,
>  this can be trivially done as a clang-tidy check, or clang diag.

There are obviously pros and cons to the location to do this. Let me start by saying the thread count thing is just a simple starter, the idea is to allow opt-in analysis that try to provide way more insights into the program. Later I also want to modify the code and provide feedback at runtime or based on profiling data.
That said, there are other reasons to put it here, e.g., Fortran ;)
The variable vs constant thing could be checked in the frontend but it is questionable if the following cases are really different, I mean if we don't want constant prop to happen first or not,

  void foo1() {
    #pragma omp parallel num_threads(8)
    {}
  }
  void foo2() {
    #pragma omp parallel num_threads(NumThreads)
    {]
  }
  void foo3(unsigned NT = 8) {
    #pragma omp parallel num_threads(NT)
    {]
  }

================
Comment at: llvm/lib/Transforms/IPO/OpenMPOpt.cpp:413
+    return false;
+  }
+
----------------
jhuber6 wrote:
> jdoerfert wrote:
> > Early exits are usually easier to read, thus:
> > ```
> > if (!UV)
> >  return;
> > ```
> > 
> > ---
> > 
> > Maybe we should be more descriptive, telling people that fixed numbers are not future compatible. *And* we should not issue the remark if we find a worksharing loop with a constant trip count (that is a multiple of this constant). I guess the latter part can just be a TODO for now, maybe also a test case:
> > ```
> > #pragma omp parallel for num_threads(8)
> > for (int i = 0; i < 8; ++i)
> >   body(i);
> > ```
> > That reminds me that we should determine the loop trip count for worksharing loops as well. We could even use that value to feed it into "set_num_threads" to avoid waking more threads than needed.
> Might be helpful, something like "Use of OpenMP parallel region with constant number of threads is not future compatibile. Consider a dynamic thread count instead." 
> 
> And are there any general ways to extract the loop tripcount in LLVM? Depending on where you are in the optimization the loop could take many different forms as far as I'm aware.
"Consider using a scalable thread count." + ", or remove the `num_threads` clause.

================
Comment at: llvm/lib/Transforms/IPO/OpenMPOpt.cpp:413
+    return false;
+  }
+
----------------
jdoerfert wrote:
> jhuber6 wrote:
> > jdoerfert wrote:
> > > Early exits are usually easier to read, thus:
> > > ```
> > > if (!UV)
> > >  return;
> > > ```
> > > 
> > > ---
> > > 
> > > Maybe we should be more descriptive, telling people that fixed numbers are not future compatible. *And* we should not issue the remark if we find a worksharing loop with a constant trip count (that is a multiple of this constant). I guess the latter part can just be a TODO for now, maybe also a test case:
> > > ```
> > > #pragma omp parallel for num_threads(8)
> > > for (int i = 0; i < 8; ++i)
> > >   body(i);
> > > ```
> > > That reminds me that we should determine the loop trip count for worksharing loops as well. We could even use that value to feed it into "set_num_threads" to avoid waking more threads than needed.
> > Might be helpful, something like "Use of OpenMP parallel region with constant number of threads is not future compatibile. Consider a dynamic thread count instead." 
> > 
> > And are there any general ways to extract the loop tripcount in LLVM? Depending on where you are in the optimization the loop could take many different forms as far as I'm aware.
> "Consider using a scalable thread count." + ", or remove the `num_threads` clause.
>  And are there any general ways to extract the loop tripcount in LLVM? Depending on where you are in the optimization the loop could take many different forms as far as I'm aware.

There is, and yes loops have different forms. The usual way is to ask ScalarEvolution for the trip count, however, that won't work for worksharing loops. 

Though, worksharing loops are actually easy to predict. Look for the `__kmpc_for_init` (or similar) call and follow the lower and upper bound pointers. They should be written exactly once, and the difference between those two values is the trip count, at least for clang as we do normalize the loops. More general we also need to follow the stride argument and do some slightly more complex computation but it is still doable. You can even use ScalarEvolution to represent the bounds, stride, and difference. That way allows you to also create an expression representing the value into the code (if that is needed later), and it will simplify the expression if possible, e.g., constant fold it.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D81036/new/

https://reviews.llvm.org/D81036