[Openmp-commits] [openmp] 1616407 - [OpenMP] Add remark documentation to the OpenMP webpage

Fri Jul 16 11:10:10 PDT 2021

Author: Joseph Huber
Date: 2021-07-16T14:09:43-04:00
New Revision: 16164079213ded81706a9c6d00874805e2dccbdd

URL: https://github.com/llvm/llvm-project/commit/16164079213ded81706a9c6d00874805e2dccbdd
DIFF: https://github.com/llvm/llvm-project/commit/16164079213ded81706a9c6d00874805e2dccbdd.diff

LOG: [OpenMP] Add remark documentation to the OpenMP webpage

This patch begins adding documentation for each remark emitted by
`openmp-opt`. This builds on the IDs introduced in D105939 so that users
can more easily identify each remark in the webpage.

Depends on D105939.

Reviewed By: jdoerfert

Differential Revision: https://reviews.llvm.org/D106018

Added: 
    openmp/docs/remarks/OMP100.rst
    openmp/docs/remarks/OMP101.rst
    openmp/docs/remarks/OMP102.rst
    openmp/docs/remarks/OMP110.rst
    openmp/docs/remarks/OMP111.rst
    openmp/docs/remarks/OMP112.rst
    openmp/docs/remarks/OMP113.rst
    openmp/docs/remarks/OMP120.rst
    openmp/docs/remarks/OMP121.rst
    openmp/docs/remarks/OMP130.rst
    openmp/docs/remarks/OMP131.rst
    openmp/docs/remarks/OMP132.rst
    openmp/docs/remarks/OMP133.rst
    openmp/docs/remarks/OMP140.rst
    openmp/docs/remarks/OMP150.rst
    openmp/docs/remarks/OMP160.rst
    openmp/docs/remarks/OMP170.rst

Modified: 
    openmp/docs/remarks/OptimizationRemarks.rst

Removed: 
    


################################################################################
diff  --git a/openmp/docs/remarks/OMP100.rst b/openmp/docs/remarks/OMP100.rst
new file mode 100644
index 0000000000000..72a7fd27d94ba

--- /dev/null
+++ b/openmp/docs/remarks/OMP100.rst
@@ -0,0 +1,26 @@
+Potentially unknown OpenMP target region caller `[OMP100]`
+==========================================================
+
+.. _omp100:
+.. _omp_no_external_caller_in_target_region:
+
+A function remark that indicates the function, when compiled for a GPU, is
+potentially called from outside the translation unit. Note that a remark is
+only issued if we tried to perform an optimization which would require us to
+know all callers on the GPU.
+
+To facilitate OpenMP semantics on GPUs we provide a runtime mechanism through
+which the code that makes up the body of a parallel region is shared with the
+threads in the team. Generally we use the address of the outlined parallel
+region to identify the code that needs to be executed. If we know all target
+regions that reach the parallel region we can avoid this function pointer
+passing scheme and often improve the register usage on the GPU. However, If a
+parallel region on the GPU is in a function with external linkage we may not
+know all callers statically. If there are outside callers within target
+regions, this remark is to be ignored. If there are no such callers, users can
+modify the linkage and thereby help optimization with a `static` or
+`__attribute__((internal))` function annotation. If changing the linkage is
+impossible, e.g., because there are outside callers on the host, one can split
+the function into an external visible interface which is not compiled for
+the target and an internal implementation which is compiled for the target
+and should be called from within the target region.

diff  --git a/openmp/docs/remarks/OMP101.rst b/openmp/docs/remarks/OMP101.rst
new file mode 100644
index 0000000000000..4483cfcc33a98
--- /dev/null
+++ b/openmp/docs/remarks/OMP101.rst
@@ -0,0 +1,6 @@
+Parallel region is used in unknown / unexpected ways. Will not attempt to rewrite the state machine. [OMP101]
+=============================================================================================================
+
+.. _omp101:
+
+An analysis remark that indicates that a parallel region has unknown calls.

diff  --git a/openmp/docs/remarks/OMP102.rst b/openmp/docs/remarks/OMP102.rst
new file mode 100644
index 0000000000000..32ae59ae48239
--- /dev/null
+++ b/openmp/docs/remarks/OMP102.rst
@@ -0,0 +1,8 @@
+Parallel region is not called from a unique kernel. Will not attempt to rewrite the state machine. [OMP102]
+===========================================================================================================
+
+.. _omp102:
+
+This analysis remark indicates that a given parallel region is called by
+multiple kernels. This prevents the compiler from optimizing it to a single
+kernel and rewrite the state machine.

diff  --git a/openmp/docs/remarks/OMP110.rst b/openmp/docs/remarks/OMP110.rst
new file mode 100644
index 0000000000000..6d69d6d4bb20f
--- /dev/null
+++ b/openmp/docs/remarks/OMP110.rst
@@ -0,0 +1,83 @@
+Moving globalized variable to the stack. [OMP110]
+=================================================
+
+.. _omp110:
+
+This optimization remark indicates that a globalized variable was moved back to
+thread-local stack memory on the device. This occurs when the optimization pass
+can determine that a globalized variable is not possibly be shared between
+threads and globalization was unnecessary. Using stack memory is the best-case
+scenario for data globalization as the variable can now be stored in fast
+register files on the device. This optimization requires full visibility of each
+variable.
+
+Globalization typically occurs when a pointer to a thread-local variable escapes
+the current scope. The compiler needs to be pessimistic and assume that the
+pointer could be shared between multiple threads according to the OpenMP
+standard. This is expensive on target offloading devices that do not allow
+threads to share data by default. Instead, this data must be moved to memory
+that can be shared, such as shared or global memory. This optimization moves the
+data back from shared or global memory to thread-local stack memory if the data
+is not actually shared between the threads.
+
+Examples
+--------
+
+A trivial example of globalization occurring can be seen with this example. The
+compiler sees that a pointer to the thread-local variable ``x`` escapes the
+current scope and must globalize it even though it is not actually necessary.
+Fortunately, this optimization can undo this by looking at its usage.
+
+.. code-block:: c++
+
+  void use(int *x) { }
+  
+  void foo() {
+    int x;
+    use(&x);
+  }
+  
+  int main() {
+  #pragma omp target parallel
+    foo();
+  }
+
+.. code-block:: console
+
+  $ clang++ -fopenmp -fopenmp-targets=nvptx64 omp110.cpp -O1 -Rpass=openmp-opt
+  omp110.cpp:6:7: remark: Moving globalized variable to the stack. [OMP110]
+    int x;
+        ^
+
+A less trivial example can be seen using C++'s complex numbers. In this case the
+overloaded arithmetic operators cause pointers to the complex numbers to escape
+the current scope, but they can again be removed once the usage is visible.
+
+.. code-block:: c++
+
+  #include <complex>
+  
+  using complex = std::complex<double>;
+  
+  void zaxpy(complex *X, complex *Y, const complex D, int N) {
+  #pragma omp target teams distribute parallel for firstprivate(D)
+    for (int i = 0; i < N; ++i)
+      Y[i] = D * X[i] + Y[i];
+  }
+
+.. code-block:: console
+
+  $ clang++ -fopenmp -fopenmp-targets=nvptx64 omp110.cpp -O1 -Rpass=openmp-opt
+  In file included from omp110.cpp:1:
+  In file included from /usr/bin/clang/lib/clang/13.0.0/include/openmp_wrappers/complex:27:
+  /usr/include/c++/8/complex:328:20: remark: Moving globalized variable to the stack. [OMP110]
+        complex<_Tp> __r = __x;
+                     ^
+  /usr/include/c++/8/complex:388:20: remark: Moving globalized variable to the stack. [OMP110]
+        complex<_Tp> __r = __x;
+                     ^
+
+Diagnostic Scope
+----------------
+
+OpenMP target offloading optimization remark.

diff  --git a/openmp/docs/remarks/OMP111.rst b/openmp/docs/remarks/OMP111.rst
new file mode 100644
index 0000000000000..c0f4e3b9adc28
--- /dev/null
+++ b/openmp/docs/remarks/OMP111.rst
@@ -0,0 +1,66 @@
+Replaced globalized variable with X bytes of shared memory. [OMP111]
+====================================================================
+
+.. _omp111:
+
+This optimization occurs when a globalized variable's data is shared between
+multiple threads, but requires a static amount of memory that can be determined
+at compile time. This is the case when only a single thread creates the memory
+and then shares is between every thread. The memory can then be pushed to a
+static buffer of shared memory on the device. This optimization allows users to
+declare shared memory on the device without using OpenMP's allocators.
+
+Globalization normally occurs when a pointer to a thread-local variables escapes
+the current scope. If a single thread is responsible for creating and sharing
+the data it can instead be mapped directly to shared memory on the target
+device. Checking if only a single thread can execute an instruction requires
+that the parent functions have internal linkage. Otherwise, an external caller
+could invalidate this analysis but having multiple threads call that function.
+The optimization pass can automatically make internal copied of each function,
+but it is still recommended to mark them as internal using keywords like
+``static`` whenever possible.
+
+Example
+-------
+
+This optimization should apply to any variable declared in an OpenMP target
+region that is then shared with every thread in a parallel region. This allows
+the user to declare shared memory without using custom allocators. A simple
+stencil calculation shows how this can be used.
+
+.. code-block:: c++
+
+  void stencil(int M, int N, double *X, double *Y) {
+  #pragma omp target teams distribute collapse(2) \
+    map(to : X [0:M * N]) map(tofrom : Y [0:M * N])
+    for (int i0 = 0; i0 < M; i0 += MC) {
+      for (int j0 = 0; j0 < N; j0 += NC) {
+        double sX[MC][NC];
+  
+  #pragma omp parallel for collapse(2) default(firstprivate)
+        for (int i1 = 0; i1 < MC; ++i1)
+          for (int j1 = 0; j1 < NC; ++j1)
+            sX[i1][j1] = X[(i0 + i1) * N + (j0 + j1)];
+  
+  #pragma omp parallel for collapse(2) default(firstprivate)
+        for (int i1 = 1; i1 < MC - 1; ++i1)
+          for (int j1 = 1; j1 < NC - 1; ++j1)
+            Y[(i0 + i1) * N + j0 * j1] = (sX[i1 + 1][j1] + sX[i1 - 1][j1] +
+                                          sX[i1][j1 + 1] + sX[i1][j1 - 1] +
+                                          -4.0 * sX[i1][j1]) / (dX * dX);
+      }   
+    }
+  }
+
+.. code-block:: console
+
+
+  $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O1 -Rpass=openmp-opt -fopenmp-version=51 omp111.cpp 
+  omp111.cpp:10:14: remark: Replaced globalized variable with 8192 bytes of shared memory. [OMP111]
+      double sX[MC][NC];
+             ^
+
+Diagnostic Scope
+----------------
+
+OpenMP target offloading optimization remark.

diff  --git a/openmp/docs/remarks/OMP112.rst b/openmp/docs/remarks/OMP112.rst
new file mode 100644
index 0000000000000..b4b7c7d48f07d
--- /dev/null
+++ b/openmp/docs/remarks/OMP112.rst
@@ -0,0 +1,89 @@
+Found thread data sharing on the GPU. Expect degraded performance due to data globalization. [OMP112]
+=====================================================================================================
+
+.. _omp112:
+
+This missed remark indicates that a globalized value was found on the target
+device that was not either replaced with stack memory by :ref:`OMP110 <omp110>`
+or shared memory by :ref:`OMP111 <omp111>`. Globalization that has not been
+removed will need to be handled by the runtime and will significantly hurt
+performance. 
+
+The OpenMP standard expects that threads can always share their data between
+each-other.  However, this is not true by default when offloading to a target
+device such as a GPU. Threads on a GPU cannot shared their data unless it is
+first placed in global or shared memory. In order to create standards complaint
+code, the Clang compiler will globalize any variables that could potentially be
+shared between the threads. In the majority of cases, globalized variables can
+either be returns to a thread-local stack, or pushed to shared memory. However,
+in a few cases it is necessary and will cause a performance penalty.
+
+Examples
+--------
+
+This example shows legitimate data sharing on the device. It is a convoluted
+example, but is completely complaint with the OpenMP standard. If globalization
+was not added this would result in 
diff erent results on 
diff erent target
+devices.
+
+.. code-block:: c++
+
+  #include <omp.h>
+  #include <cstdio>
+  
+  #pragma omp declare target
+  static int *p;
+  #pragma omp end declare target
+  
+  void foo() {
+    int x = omp_get_thread_num();
+    if (omp_get_thread_num() == 1)
+      p = &x;
+  
+  #pragma omp barrier
+  
+    printf ("Thread %d: %d\n", omp_get_thread_num(), *p);
+  }
+  
+  int main() {
+  #pragma omp target parallel
+    foo();
+  }
+
+.. code-block:: console
+
+  $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O1 -Rpass-missed=openmp-opt omp112.cpp 
+  omp112.cpp:9:7: remark: Found thread data sharing on the GPU. Expect degraded performance
+  due to data globalization. [OMP112] [-Rpass-missed=openmp-opt]
+  int x = omp_get_thread_num();
+      ^
+
+A less convoluted example globalization that cannot be removed occurs when
+calling functions that aren't visible from the current translation unit.
+
+.. code-block:: c++
+
+  extern void use(int *x);
+  
+  void foo() {
+    int x;
+    use(&x);
+  }
+  
+  int main() {
+  #pragma omp target parallel
+    foo();
+  }
+
+.. code-block:: console
+
+  $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O1 -Rpass-missed=openmp-opt omp112.cpp 
+  omp112.cpp:4:7: remark: Found thread data sharing on the GPU. Expect degraded performance
+  due to data globalization. [OMP112] [-Rpass-missed=openmp-opt]
+  int x;
+      ^
+
+Diagnostic Scope
+----------------
+
+OpenMP target offloading missed remark.

diff  --git a/openmp/docs/remarks/OMP113.rst b/openmp/docs/remarks/OMP113.rst
new file mode 100644
index 0000000000000..d319f7ab68fb5
--- /dev/null
+++ b/openmp/docs/remarks/OMP113.rst
@@ -0,0 +1,81 @@
+Could not move globalized variable to the stack. Variable is potentially captured in call. Mark parameter as `__attribute__((noescape))` to override. [OMP113]
+==============================================================================================================================================================
+
+.. _omp113:
+
+This missed remark indicates that a globalized value could not be moved to the
+stack because it is potentially captured by a call to a function we cannot
+analyze. In order for a globalized variable to be moved to the stack, copies to
+its pointer cannot be stored. Otherwise it is considered captured and could
+potentially be shared between the threads. This can be overridden using a
+parameter level attribute as suggested in the remark text.
+
+Globalization will occur when a pointer to a thread-local variable escapes
+the current scope. In most cases it can be determined that the variable cannot
+be shared if a copy of its pointer is never made. However, this remark indicates
+a copy of the variable either is present, or is possible because it is used
+outside the current translation unit.
+
+Examples
+--------
+
+If a pointer to a thread-local variable is passed to a function not visible in
+the current translation unit we need to assume a copy is made of it that can be
+shared between the threads. This prevents :ref:`OMP110 <omp110>` from
+triggering, which will result in a performance penalty when executing on the
+target device.
+
+.. code-block:: c++
+
+  extern void use(int *x);
+  
+  void foo() {
+    int x;
+    use(&x);
+  }
+  
+  int main() {
+  #pragma omp target parallel
+    foo();
+  }
+
+.. code-block:: console
+
+   $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O2 -Rpass-missed=openmp-opt omp113.cpp
+   missed.cpp:4:7: remark: Could not move globalized variable to the stack. Variable is 
+   potentially captured in call. Mark parameter as `__attribute__((noescape))` to 
+   override. [OMP113]
+     int x;
+         ^
+
+As the remark suggests, this behaviour can be overridden using the ``noescape``
+attribute. This tells the compiler that no reference to the object the pointer
+points to that is derived from the parameter value will survive after the
+function returns. The user is responsible for verifying that this assertion is
+correct.
+
+.. code-block:: c++
+
+  extern void use(__attribute__((noescape)) int *x);
+
+  void foo() {
+    int x;
+    use(&x);
+  }
+  
+  int main() {
+  #pragma omp target parallel
+    foo();
+  }
+
+.. code-block:: console
+
+   $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O2 -Rpass=openmp-opt omp113.cpp
+   missed.cpp:4:7: remark: Moving globalized variable to the stack. [OMP110]
+   int x;
+       ^
+
+Diagnostic Scope
+----------------
+
+OpenMP target offloading missed remark.

diff  --git a/openmp/docs/remarks/OMP120.rst b/openmp/docs/remarks/OMP120.rst
new file mode 100644
index 0000000000000..d3c626db34415
--- /dev/null
+++ b/openmp/docs/remarks/OMP120.rst
@@ -0,0 +1,93 @@
+Transformed generic-mode kernel to SPMD-mode [OMP120]
+=====================================================
+
+.. _omp120:
+
+This optimization remark indicates that the execution strategy for the OpenMP
+target offloading kernel was changed. Generic-mode kernels execute by using a
+single thread to schedule parallel worker threads using a state machine. This
+code transformation can move a kernel that was initially generated in generic
+mode to SPMD-mode where all threads are active at the same time with no state
+machine. This is only possible if the instructions previously executed by a
+single thread have no side-effects or can be guarded. If the instructions have
+no side-effects they are simply recomputed by each thread.
+
+Generic-mode is often considerably slower than SPMD-mode because of the extra
+overhead required to separately schedule worker threads and pass data between
+them.This optimization allows users to use generic-mode semantics while
+achieving the performance of SPMD-mode. This can be helpful when defining shared
+memory between the threads using :ref:`OMP111 <omp111>`.
+
+Examples
+--------
+
+Normally, any kernel that contains split OpenMP target and parallel regions will
+be executed in generic-mode. Sometimes it is easier to use generic-mode
+semantics to define shared memory, or more tightly control the distribution of
+the threads. This shows a naive matrix-matrix multiplication that contains code
+that will need to be guarded.
+
+.. code-block:: c++
+
+  void matmul(int M, int N, int K, double *A, double *B, double *C) {
+  #pragma omp target teams distribute collapse(2) \
+    map(to:A[0: M*K]) map(to:B[0: K*N]) map(tofrom:C[0 : M*N])
+    for (int i = 0; i < M; i++) {
+      for (int j = 0; j < N; j++) {
+        double sum = 0.0;
+
+  #pragma omp parallel for reduction(+:sum) default(firstprivate)
+        for (int k = 0; k < K; k++)
+          sum += A[i*K + k] * B[k*N + j];
+
+        C[i*N + j] = sum;
+      }
+    }
+  }
+
+.. code-block:: console
+
+   $ clang++ -fopenmp -fopenmp-targets=nvptx64 -fopenmp-version=51 -O2 -Rpass=openmp-opt omp120.cpp
+   omp120.cpp:6:14: remark: Replaced globalized variable with 8 bytes of shared memory. [OMP111]
+        double sum = 0.0;
+               ^
+   omp120.cpp:2:1: remark: Transformed generic-mode kernel to SPMD-mode. [OMP120]
+   #pragma omp target teams distribute collapse(2) \
+   ^
+
+This requires guarding the store to the shared variable ``sum`` and the store to
+the matrix ``C``. This can be thought of as generating the code below.
+
+.. code-block:: c++
+
+  void matmul(int M, int N, int K, double *A, double *B, double *C) {
+  #pragma omp target teams distribute collapse(2) \
+    map(to:A[0: M*K]) map(to:B[0: K*N]) map(tofrom:C[0 : M*N])
+    for (int i = 0; i < M; i++) {
+      for (int j = 0; j < N; j++) {
+      double sum;
+  #pragma omp parallel default(firstprivate) shared(sum)
+      {
+      #pragma omp barrier
+      if (omp_get_thread_num() == 0)
+        sum = 0.0;
+      #pragma omp barrier
+
+  #pragma omp for reduction(+:sum)
+        for (int k = 0; k < K; k++)
+          sum += A[i*K + k] * B[k*N + j];
+
+      #pragma omp barrier
+      if (omp_get_thread_num() == 0)
+        C[i*N + j] = sum;
+      #pragma omp barrier
+      }
+      }
+    }
+  }
+
+
+Diagnostic Scope
+----------------
+
+OpenMP target offloading optimization remark.

diff  --git a/openmp/docs/remarks/OMP121.rst b/openmp/docs/remarks/OMP121.rst
new file mode 100644
index 0000000000000..baec1e5b841e5
--- /dev/null
+++ b/openmp/docs/remarks/OMP121.rst
@@ -0,0 +1,80 @@
+Value has potential side effects preventing SPMD-mode execution. Add `__attribute__((assume(\"ompx_spmd_amenable\")))` to the called function to override. [OMP121]
+===================================================================================================================================================================
+
+.. _omp121:
+
+This analysis remarks indicates that a potential side-effect that cannot be
+guarded prevents the target region from executing in SPMD-mode. SPMD-mode
+requires that each thread is active inside the region. Any instruction that
+cannot be either recomputed by each thread independently or guarded and executed
+by a single thread prevents the region from executing in SPMD-mode. 
+
+This remark will attempt to print out the instructions preventing the region
+from being executed in SPMD-mode. Calls to functions outside the current
+translation unit will prevent this transformation from occurring as well, but
+can be overridden using an assumption stating that it contains no calls that
+prevent SPMD execution.
+
+Examples
+--------
+
+Calls to functions outside the current translation unit may contain instructions
+or operations that cannot be executed in SPMD-mode.
+
+.. code-block:: c++
+
+  extern int work();
+
+  void use(int x);
+  
+  void foo() {
+  #pragma omp target teams
+    {
+      int x = work();
+  #pragma omp parallel
+        use(x);
+  
+    }
+  }
+
+
+.. code-block:: console
+
+   $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O2 -Rpass-analysis=openmp-opt omp121.cpp
+   omp121.cpp:8:13: remark: Value has potential side effects preventing SPMD-mode 
+   execution.  Add `__attribute__((assume("ompx_spmd_amenable")))` to the called function 
+   to override. [OMP121]
+   int x = work();
+            ^
+
+As the remark suggests, the problem is caused by the unknown call to the
+external function ``work``. This can be overridden by asserting that it does not
+contain any code that prevents SPMD-mode execution.
+
+.. code-block:: c++
+
+  __attribute__((assume("ompx_spmd_amenable"))) extern int work();
+
+  void use(int x);
+  
+  void foo() {
+  #pragma omp target teams
+    {
+      int x = work();
+  #pragma omp parallel
+        use(x);
+  
+    }
+  }
+
+.. code-block:: console
+
+   $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O2 -Rpass=openmp-opt omp121.cpp
+   omp121.cpp:6:1: remark: Transformed generic-mode kernel to SPMD-mode. [OMP120]
+   #pragma omp target teams
+   ^
+
+Diagnostic Scope
+----------------
+
+OpenMP target offloading analysis remark.

diff  --git a/openmp/docs/remarks/OMP130.rst b/openmp/docs/remarks/OMP130.rst
new file mode 100644
index 0000000000000..9ba8d24efe135
--- /dev/null
+++ b/openmp/docs/remarks/OMP130.rst
@@ -0,0 +1,36 @@
+Removing unused state machine from generic-mode kernel. [OMP130]
+================================================================
+
+.. _omp130:
+
+This optimization remark indicates that an unused state machine was removed from
+a target region. This occurs when there are no parallel regions inside of a
+target construct. Normally, a state machine is required to schedule the threads
+inside of a parallel region. If there are no parallel regions, the state machine
+is unnecessary because there is only a single thread active at any time. 
+
+Examples
+--------
+
+This optimization should occur on any target region that does not contain any
+parallel work.
+
+.. code-block:: c++
+
+   void copy(int N, double *X, double *Y) {
+   #pragma omp target teams distribute map(tofrom: X[0:N]) map(tofrom: Y[0:N])
+     for (int i = 0; i < N; ++i)
+       Y[i] = X[i];
+   }
+
+.. code-block:: console
+
+   $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O2 -Rpass=openmp-opt omp130.cpp
+   omp130.cpp:2:1: remark: Removing unused state machine from generic-mode kernel. [OMP130]
+   #pragma omp target teams distribute map(tofrom: X[0:N]) map(tofrom: Y[0:N])
+   ^
+
+Diagnostic Scope
+----------------
+
+OpenMP target offloading optimization remark.

diff  --git a/openmp/docs/remarks/OMP131.rst b/openmp/docs/remarks/OMP131.rst
new file mode 100644
index 0000000000000..0864fc8cd7f1a
--- /dev/null
+++ b/openmp/docs/remarks/OMP131.rst
@@ -0,0 +1,45 @@
+Rewriting generic-mode kernel with a customized state machine. [OMP131]
+=======================================================================
+
+.. _omp131:
+
+This optimization remark indicates that a generic-mode kernel on the device was
+specialized for the given target region. When offloading in generic-mode, a
+state machine is required to schedule the work between the parallel worker
+threads. This optimization specializes the state machine in cases where there is
+a known number of parallel regions inside the kernel.
+
+Examples
+--------
+
+This optimization should occur on any generic-mode kernel that has visibility on
+all parallel regions, but cannot be moved to SPMD-mode.
+
+.. code-block:: c++
+
+   #pragma omp declare target
+   int TID;
+   #pragma omp end declare target
+
+   void foo() {
+   #pragma omp target
+   {
+    TID = omp_get_thread_num();
+    #pragma omp parallel
+    {
+      work();
+    }
+   }
+   }
+
+.. code-block:: console
+
+   $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O2 -Rpass=openmp-opt omp131.cpp
+   omp131.cpp:8:1: remark: Rewriting generic-mode kernel with a customized state machine. [OMP131]
+   #pragma omp target
+   ^
+
+Diagnostic Scope
+----------------
+
+OpenMP target offloading optimization remark.

diff  --git a/openmp/docs/remarks/OMP132.rst b/openmp/docs/remarks/OMP132.rst
new file mode 100644
index 0000000000000..6ba518f0cde76
--- /dev/null
+++ b/openmp/docs/remarks/OMP132.rst
@@ -0,0 +1,45 @@
+Generic-mode kernel is executed with a customized state machine that requires a fallback. [OMP132]
+==================================================================================================
+
+.. _omp132:
+
+This analysis remark indicates that a state machine rewrite occurred, but
+could not be done fully because of unknown calls to functions that may contain
+parallel regions. The state machine handles scheduling work between parallel
+worker threads on the device when operating in generic-mode. If there are
+unknown parallel regions it prevents the optimization from fully rewriting the
+state machine.
+
+Examples
+--------
+
+This will occur for any generic-mode kernel that may contain unknown parallel
+regions. This is typically coupled with the :ref:`OMP133 <omp133>` remark.
+
+.. code-block:: c++
+
+   extern void setup();
+
+   void foo() {
+   #pragma omp target
+   {
+     setup();
+     #pragma omp parallel
+     {
+       work();
+     }
+   }
+   }
+
+.. code-block:: console
+
+   $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O2 -Rpass-analysis=openmp-opt omp132.cpp
+   omp133.cpp:4:1: remark: Generic-mode kernel is executed with a customized state machine 
+   that requires a fallback. [OMP132]
+   #pragma omp target
+   ^
+
+Diagnostic Scope
+----------------
+
+OpenMP target offloading analysis remark.

diff  --git a/openmp/docs/remarks/OMP133.rst b/openmp/docs/remarks/OMP133.rst
new file mode 100644
index 0000000000000..3bc1a33ce8cb9
--- /dev/null
+++ b/openmp/docs/remarks/OMP133.rst
@@ -0,0 +1,70 @@
+Call may contain unknown parallel regions. Use `__attribute__((assume("omp_no_parallelism")))` to override. [OMP133]
+====================================================================================================================
+
+.. _omp133:
+
+This analysis remark identifies calls that prevented :ref:`OMP131 <omp131>` from
+providing the generic-mode kernel with a fully specialized state machine. This
+remark will identify each call that may contain unknown parallel regions that
+caused the kernel to require a fallback.
+
+Examples
+--------
+
+This will occur for any generic-mode kernel that may contain unknown parallel
+regions. This is typically coupled with the :ref:`OMP132 <omp132>` remark.
+
+.. code-block:: c++
+
+   extern void setup();
+
+   void foo() {
+   #pragma omp target
+   {
+     setup();
+     #pragma omp parallel
+     {
+       work();
+     }
+   }
+   }
+
+.. code-block:: console
+
+   $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O2 -Rpass-analysis=openmp-opt omp133.cpp
+   omp133.cpp:6:5: remark: Call may contain unknown parallel regions. Use 
+   `__attribute__((assume("omp_no_parallelism")))` to override. [OMP133]
+   setup();
+   ^
+
+The remark suggests marking the function with the assumption that it contains no
+parallel regions. If this is done then the kernel will be rewritten with a fully
+specialized state machine.
+
+.. code-block:: c++
+
+   __attribute__((assume("omp_no_parallelism"))) extern void setup();
+
+
+   void foo() {
+   #pragma omp target
+   {
+     setup();
+     #pragma omp parallel
+     {
+       work();
+     }
+   }
+   }
+
+.. code-block:: console
+
+   $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O2 -Rpass=openmp-opt omp133.cpp
+   omp133.cpp:4:1: remark: Rewriting generic-mode kernel with a customized state machine. [OMP131]
+   #pragma omp target
+   ^
+
+Diagnostic Scope
+----------------
+
+OpenMP target offloading analysis remark.

diff  --git a/openmp/docs/remarks/OMP140.rst b/openmp/docs/remarks/OMP140.rst
new file mode 100644
index 0000000000000..57fc4653267e2
--- /dev/null
+++ b/openmp/docs/remarks/OMP140.rst
@@ -0,0 +1,49 @@
+Could not internalize function. Some optimizations may not be possible. [OMP140]
+====================================================================================================================
+
+.. _omp140:
+
+This analysis remark indicates that function internalization failed for the
+given function. Internalization occurs when a call to a function that ordinarily
+has external visibility is replaced with a call to a copy of that function with
+only internal visibility. This allows the compiler to make strong static
+assertions about the context a function is called in. Without internalization
+this analysis would always be invalidated by the possibility of someone calling
+the function in a 
diff erent context outside of the current translation unit.
+This is necessary for optimizations like :ref:`OMP111 <omp111>` and :ref:`OMP120
+<omp120>`. If a function failed to be internalized it most likely has linkage
+that cannot be copied. Internalization is currently only enabled by default for
+OpenMP target offloading.
+
+Examples
+--------
+
+This will occur for any function declaration that has incompatible linkage.
+
+.. code-block:: c++
+
+   __attribute__((weak)) void setup();
+
+   void foo() {
+   #pragma omp target
+   {
+     setup();
+     #pragma omp parallel
+     {
+       work();
+     }
+   }
+   }
+
+.. code-block:: console
+
+   $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O1 -Rpass-analysis=openmp-opt omp140.cpp
+   omp140.cpp:1:1: remark: Could not internalize function. Some optimizations may not 
+   be possible. [OMP140]
+   __attribute__((weak)) void setup() {
+   ^
+
+Diagnostic Scope
+----------------
+
+OpenMP analysis remark.

diff  --git a/openmp/docs/remarks/OMP150.rst b/openmp/docs/remarks/OMP150.rst
new file mode 100644
index 0000000000000..0016a58955ac1
--- /dev/null
+++ b/openmp/docs/remarks/OMP150.rst
@@ -0,0 +1,42 @@
+Parallel region merged with parallel region at <location>. [OMP150]
+===================================================================
+
+.. _omp150:
+
+This optimization remark indicates that a parallel region was merged with others
+into a single parallel region. Parallel region merging fuses consecutive
+parallel regions to reduce the team activation overhead of forking and increases
+the scope of possible OpenMP-specific optimizations within merged parallel
+regions. This optimization can also guard sequential code between two parallel
+regions if applicable.
+
+Example
+-------
+
+This optimization should apply to any compatible and consecutive parallel
+regions. In this case the sequential region between the parallel regions will be
+guarded so it is only executed by a single thread in the new merged region.
+
+.. code-block:: c++
+
+  void foo() {
+  #pragma omp parallel
+    parallel_work();
+  
+    sequential_work();
+  
+  #pragma omp parallel
+    parallel_work();
+  }
+
+.. code-block:: console
+
+   $ clang++ -fopenmp -O2 -Rpass=openmp-opt -mllvm -openmp-opt-enable-merging omp150.cpp
+   omp150.cpp:2:1: remark: Parallel region merged with parallel region at merge.cpp:7:1. [OMP150]
+   #pragma omp parallel
+   ^
+
+Diagnostic Scope
+----------------
+
+OpenMP optimization remark.

diff  --git a/openmp/docs/remarks/OMP160.rst b/openmp/docs/remarks/OMP160.rst
new file mode 100644
index 0000000000000..a65315f844161
--- /dev/null
+++ b/openmp/docs/remarks/OMP160.rst
@@ -0,0 +1,44 @@
+Removing parallel region with no side-effects. [OMP160]
+=======================================================
+
+.. _omp160:
+
+This optimization remark indicates that a parallel region was deleted because it
+was not found to have any side-effects. This can occur if the region does not
+write any of its results to memory visible outside the region. This optimization
+is necessary because the barrier between sequential and parallel code typically
+prevents dead code elimination from completely removing the region. Otherwise
+there will still be overhead to fork and merge the threads.
+
+Example
+-------
+
+This optimization occurs whenever a parallel region was not found to have any
+side-effects. This can occur if the parallel region only reads memory or is
+simply empty.
+
+.. code-block:: c++
+
+  void foo() {
+  #pragma omp parallel
+    { }
+  #pragma omp parallel
+    { int x = 1; }
+  }
+  }
+
+.. code-block:: console
+
+   $ clang++ -fopenmp -O2 -Rpass=openmp-opt omp160.cpp
+   omp160.cpp:4:1: remark: Removing parallel region with no side-effects. [OMP160] [-Rpass=openmp-opt]
+   #pragma omp parallel
+   ^
+   delete.cpp:2:1: remark: Removing parallel region with no side-effects. [OMP160] [-Rpass=openmp-opt]
+   #pragma omp parallel
+   ^
+   ^
+
+Diagnostic Scope
+----------------
+
+OpenMP optimization remark.

diff  --git a/openmp/docs/remarks/OMP170.rst b/openmp/docs/remarks/OMP170.rst
new file mode 100644
index 0000000000000..5fcb540fc6b86
--- /dev/null
+++ b/openmp/docs/remarks/OMP170.rst
@@ -0,0 +1,41 @@
+OpenMP runtime call <call> deduplicated. [OMP170]
+====================================================================
+
+.. _omp170:
+
+This optimization remark indicates that a call to an OpenMP runtime call was
+replaced with the result of an existing one. This occurs when the compiler knows
+that the result of a runtime call is immutable. Removing duplicate calls is done
+by replacing all calls to that function with the result of the first call. This
+cannot be done automatically by the compiler because the implementations of the
+OpenMP runtime calls live in a separate library the compiler cannot see.
+
+Example
+-------
+
+This optimization will trigger for known OpenMP runtime calls whose return value
+will not change.
+
+.. code-block:: c++
+
+  void foo(int N) {
+    double *A = malloc(N * omp_get_thread_limit());
+    double *B = malloc(N * omp_get_thread_limit());
+  
+  #pragma omp parallel
+    work(&A[omp_get_thread_num() * N]);
+  #pragma omp parallel
+    work(&B[omp_get_thread_num() * N]);
+  }
+
+.. code-block:: console
+
+  $ clang -fopenmp -O2 -Rpass=openmp-opt omp170.c 
+  ompi170.c:2:26: remark: OpenMP runtime call omp_get_thread_limit deduplicated. [OMP170]
+  double *A = malloc(N * omp_get_thread_limit());
+                         ^
+
+Diagnostic Scope
+----------------
+
+OpenMP optimization remark.

diff  --git a/openmp/docs/remarks/OptimizationRemarks.rst b/openmp/docs/remarks/OptimizationRemarks.rst
index 4c256fd996944..1463a61be40e2 100644
--- a/openmp/docs/remarks/OptimizationRemarks.rst
+++ b/openmp/docs/remarks/OptimizationRemarks.rst
@@ -15,41 +15,95 @@ features of the remark system the clang documentation should be consulted:
   <https://clang.llvm.org/docs/ClangCommandLineReference.html#cmdoption-clang1-fsave-optimization-record>`_
 
 
-.. _ompXXX:
-
-Some OpenMP remarks start with a "tag", like `[OMP100]`, which indicates that
-there is further information about them on this page. To directly jump to the
-respective entry, navigate to
-`https://openmp.llvm.org/docs/remarks/OptimizationRemarks.html#ompXXX <https://openmp.llvm.org/docs/remarks/OptimizationRemarks.html#ompXXX>`_ where `XXX` is
-the three digit code shown in the tag.
-
-
-----
-
-
-.. _omp100:
-.. _omp_no_external_caller_in_target_region:
-
-`[OMP100]` Potentially unknown OpenMP target region caller
-----------------------------------------------------------
-
-A function remark that indicates the function, when compiled for a GPU, is
-potentially called from outside the translation unit. Note that a remark is
-only issued if we tried to perform an optimization which would require us to
-know all callers on the GPU.
-
-To facilitate OpenMP semantics on GPUs we provide a runtime mechanism through
-which the code that makes up the body of a parallel region is shared with the
-threads in the team. Generally we use the address of the outlined parallel
-region to identify the code that needs to be executed. If we know all target
-regions that reach the parallel region we can avoid this function pointer
-passing scheme and often improve the register usage on the GPU. However, If a
-parallel region on the GPU is in a function with external linkage we may not
-know all callers statically. If there are outside callers within target
-regions, this remark is to be ignored. If there are no such callers, users can
-modify the linkage and thereby help optimization with a `static` or
-`__attribute__((internal))` function annotation. If changing the linkage is
-impossible, e.g., because there are outside callers on the host, one can split
-the function into an external visible interface which is not compiled for
-the target and an internal implementation which is compiled for the target
-and should be called from within the target region.
+OpenMP Remarks
+--------------
+
+.. toctree::
+   :hidden:
+   :maxdepth: 1
+
+   OMP100
+   OMP101
+   OMP102
+   OMP110
+   OMP111
+   OMP112
+   OMP113
+   OMP120
+   OMP121
+   OMP130
+   OMP131
+   OMP132
+   OMP133
+   OMP140
+   OMP150
+   OMP160
+   OMP170
+
+.. list-table::
+   :widths: 15 15 70
+   :header-rows: 1
+
+   * - Diagnostics Number
+     - Diagnostics Kind
+     - Diagnostics Description
+   * - :ref:`OMP100 <omp100>`
+     - Analysis
+     - Potentially unknown OpenMP target region caller.
+   * - :ref:`OMP101 <omp101>`
+     - Analysis
+     - Parallel region is used in unknown / unexpected ways. Will not attempt to
+       rewrite the state machine.
+   * - :ref:`OMP102 <omp102>`
+     - Analysis
+     - Parallel region is not called from a unique kernel. Will not attempt to
+       rewrite the state machine.
+   * - :ref:`OMP110 <omp110>`
+     - Optimization
+     - Moving globalized variable to the stack.
+   * - :ref:`OMP111 <omp111>`
+     - Optimization
+     - Replaced globalized variable with X bytes of shared memory.
+   * - :ref:`OMP112 <omp112>`
+     - Missed
+     - Found thread data sharing on the GPU. Expect degraded performance due to
+       data globalization.
+   * - :ref:`OMP113 <omp113>`
+     - Missed
+     - Could not move globalized variable to the stack. Variable is potentially
+       captured in call. Mark parameter as `__attribute__((noescape))` to
+       override.
+   * - :ref:`OMP120 <omp120>`
+     - Optimization
+     - Transformed generic-mode kernel to SPMD-mode.
+   * - :ref:`OMP121 <omp121>`
+     - Analysis
+     - Value has potential side effects preventing SPMD-mode execution. Add
+       `__attribute__((assume(\"ompx_spmd_amenable\")))` to the called function
+       to override.
+   * - :ref:`OMP130 <omp130>`
+     - Optimization
+     - Removing unused state machine from generic-mode kernel.
+   * - :ref:`OMP131 <omp131>`
+     - Optimization
+     - Rewriting generic-mode kernel with a customized state machine.
+   * - :ref:`OMP132 <omp132>`
+     - Analysis
+     - Generic-mode kernel is executed with a customized state machine that
+       requires a fallback.
+   * - :ref:`OMP133 <omp133>`
+     - Analysis
+     - Call may contain unknown parallel regions. Use
+       `__attribute__((assume("omp_no_parallelism")))` to override.
+   * - :ref:`OMP140 <omp140>`
+     - Analysis
+     - Could not internalize function. Some optimizations may not be possible.
+   * - :ref:`OMP150 <omp150>`
+     - Optimization
+     - Parallel region merged with parallel region at <location>.
+   * - :ref:`OMP160 <omp160>`
+     - Optimization
+     - Removing parallel region with no side-effects.
+   * - :ref:`OMP170 <omp170>`
+     - Optimization
+     - OpenMP runtime call <call> deduplicated.