[Openmp-commits] [openmp] 1616407 - [OpenMP] Add remark documentation to the OpenMP webpage
Joseph Huber via Openmp-commits
openmp-commits at lists.llvm.org
Fri Jul 16 11:10:10 PDT 2021
Author: Joseph Huber
Date: 2021-07-16T14:09:43-04:00
New Revision: 16164079213ded81706a9c6d00874805e2dccbdd
URL: https://github.com/llvm/llvm-project/commit/16164079213ded81706a9c6d00874805e2dccbdd
DIFF: https://github.com/llvm/llvm-project/commit/16164079213ded81706a9c6d00874805e2dccbdd.diff
LOG: [OpenMP] Add remark documentation to the OpenMP webpage
This patch begins adding documentation for each remark emitted by
`openmp-opt`. This builds on the IDs introduced in D105939 so that users
can more easily identify each remark in the webpage.
Depends on D105939.
Reviewed By: jdoerfert
Differential Revision: https://reviews.llvm.org/D106018
Added:
openmp/docs/remarks/OMP100.rst
openmp/docs/remarks/OMP101.rst
openmp/docs/remarks/OMP102.rst
openmp/docs/remarks/OMP110.rst
openmp/docs/remarks/OMP111.rst
openmp/docs/remarks/OMP112.rst
openmp/docs/remarks/OMP113.rst
openmp/docs/remarks/OMP120.rst
openmp/docs/remarks/OMP121.rst
openmp/docs/remarks/OMP130.rst
openmp/docs/remarks/OMP131.rst
openmp/docs/remarks/OMP132.rst
openmp/docs/remarks/OMP133.rst
openmp/docs/remarks/OMP140.rst
openmp/docs/remarks/OMP150.rst
openmp/docs/remarks/OMP160.rst
openmp/docs/remarks/OMP170.rst
Modified:
openmp/docs/remarks/OptimizationRemarks.rst
Removed:
################################################################################
diff --git a/openmp/docs/remarks/OMP100.rst b/openmp/docs/remarks/OMP100.rst
new file mode 100644
index 0000000000000..72a7fd27d94ba
--- /dev/null
+++ b/openmp/docs/remarks/OMP100.rst
@@ -0,0 +1,26 @@
+Potentially unknown OpenMP target region caller `[OMP100]`
+==========================================================
+
+.. _omp100:
+.. _omp_no_external_caller_in_target_region:
+
+A function remark that indicates the function, when compiled for a GPU, is
+potentially called from outside the translation unit. Note that a remark is
+only issued if we tried to perform an optimization which would require us to
+know all callers on the GPU.
+
+To facilitate OpenMP semantics on GPUs we provide a runtime mechanism through
+which the code that makes up the body of a parallel region is shared with the
+threads in the team. Generally we use the address of the outlined parallel
+region to identify the code that needs to be executed. If we know all target
+regions that reach the parallel region we can avoid this function pointer
+passing scheme and often improve the register usage on the GPU. However, If a
+parallel region on the GPU is in a function with external linkage we may not
+know all callers statically. If there are outside callers within target
+regions, this remark is to be ignored. If there are no such callers, users can
+modify the linkage and thereby help optimization with a `static` or
+`__attribute__((internal))` function annotation. If changing the linkage is
+impossible, e.g., because there are outside callers on the host, one can split
+the function into an external visible interface which is not compiled for
+the target and an internal implementation which is compiled for the target
+and should be called from within the target region.
diff --git a/openmp/docs/remarks/OMP101.rst b/openmp/docs/remarks/OMP101.rst
new file mode 100644
index 0000000000000..4483cfcc33a98
--- /dev/null
+++ b/openmp/docs/remarks/OMP101.rst
@@ -0,0 +1,6 @@
+Parallel region is used in unknown / unexpected ways. Will not attempt to rewrite the state machine. [OMP101]
+=============================================================================================================
+
+.. _omp101:
+
+An analysis remark that indicates that a parallel region has unknown calls.
diff --git a/openmp/docs/remarks/OMP102.rst b/openmp/docs/remarks/OMP102.rst
new file mode 100644
index 0000000000000..32ae59ae48239
--- /dev/null
+++ b/openmp/docs/remarks/OMP102.rst
@@ -0,0 +1,8 @@
+Parallel region is not called from a unique kernel. Will not attempt to rewrite the state machine. [OMP102]
+===========================================================================================================
+
+.. _omp102:
+
+This analysis remark indicates that a given parallel region is called by
+multiple kernels. This prevents the compiler from optimizing it to a single
+kernel and rewrite the state machine.
diff --git a/openmp/docs/remarks/OMP110.rst b/openmp/docs/remarks/OMP110.rst
new file mode 100644
index 0000000000000..6d69d6d4bb20f
--- /dev/null
+++ b/openmp/docs/remarks/OMP110.rst
@@ -0,0 +1,83 @@
+Moving globalized variable to the stack. [OMP110]
+=================================================
+
+.. _omp110:
+
+This optimization remark indicates that a globalized variable was moved back to
+thread-local stack memory on the device. This occurs when the optimization pass
+can determine that a globalized variable is not possibly be shared between
+threads and globalization was unnecessary. Using stack memory is the best-case
+scenario for data globalization as the variable can now be stored in fast
+register files on the device. This optimization requires full visibility of each
+variable.
+
+Globalization typically occurs when a pointer to a thread-local variable escapes
+the current scope. The compiler needs to be pessimistic and assume that the
+pointer could be shared between multiple threads according to the OpenMP
+standard. This is expensive on target offloading devices that do not allow
+threads to share data by default. Instead, this data must be moved to memory
+that can be shared, such as shared or global memory. This optimization moves the
+data back from shared or global memory to thread-local stack memory if the data
+is not actually shared between the threads.
+
+Examples
+--------
+
+A trivial example of globalization occurring can be seen with this example. The
+compiler sees that a pointer to the thread-local variable ``x`` escapes the
+current scope and must globalize it even though it is not actually necessary.
+Fortunately, this optimization can undo this by looking at its usage.
+
+.. code-block:: c++
+
+ void use(int *x) { }
+
+ void foo() {
+ int x;
+ use(&x);
+ }
+
+ int main() {
+ #pragma omp target parallel
+ foo();
+ }
+
+.. code-block:: console
+
+ $ clang++ -fopenmp -fopenmp-targets=nvptx64 omp110.cpp -O1 -Rpass=openmp-opt
+ omp110.cpp:6:7: remark: Moving globalized variable to the stack. [OMP110]
+ int x;
+ ^
+
+A less trivial example can be seen using C++'s complex numbers. In this case the
+overloaded arithmetic operators cause pointers to the complex numbers to escape
+the current scope, but they can again be removed once the usage is visible.
+
+.. code-block:: c++
+
+ #include <complex>
+
+ using complex = std::complex<double>;
+
+ void zaxpy(complex *X, complex *Y, const complex D, int N) {
+ #pragma omp target teams distribute parallel for firstprivate(D)
+ for (int i = 0; i < N; ++i)
+ Y[i] = D * X[i] + Y[i];
+ }
+
+.. code-block:: console
+
+ $ clang++ -fopenmp -fopenmp-targets=nvptx64 omp110.cpp -O1 -Rpass=openmp-opt
+ In file included from omp110.cpp:1:
+ In file included from /usr/bin/clang/lib/clang/13.0.0/include/openmp_wrappers/complex:27:
+ /usr/include/c++/8/complex:328:20: remark: Moving globalized variable to the stack. [OMP110]
+ complex<_Tp> __r = __x;
+ ^
+ /usr/include/c++/8/complex:388:20: remark: Moving globalized variable to the stack. [OMP110]
+ complex<_Tp> __r = __x;
+ ^
+
+Diagnostic Scope
+----------------
+
+OpenMP target offloading optimization remark.
diff --git a/openmp/docs/remarks/OMP111.rst b/openmp/docs/remarks/OMP111.rst
new file mode 100644
index 0000000000000..c0f4e3b9adc28
--- /dev/null
+++ b/openmp/docs/remarks/OMP111.rst
@@ -0,0 +1,66 @@
+Replaced globalized variable with X bytes of shared memory. [OMP111]
+====================================================================
+
+.. _omp111:
+
+This optimization occurs when a globalized variable's data is shared between
+multiple threads, but requires a static amount of memory that can be determined
+at compile time. This is the case when only a single thread creates the memory
+and then shares is between every thread. The memory can then be pushed to a
+static buffer of shared memory on the device. This optimization allows users to
+declare shared memory on the device without using OpenMP's allocators.
+
+Globalization normally occurs when a pointer to a thread-local variables escapes
+the current scope. If a single thread is responsible for creating and sharing
+the data it can instead be mapped directly to shared memory on the target
+device. Checking if only a single thread can execute an instruction requires
+that the parent functions have internal linkage. Otherwise, an external caller
+could invalidate this analysis but having multiple threads call that function.
+The optimization pass can automatically make internal copied of each function,
+but it is still recommended to mark them as internal using keywords like
+``static`` whenever possible.
+
+Example
+-------
+
+This optimization should apply to any variable declared in an OpenMP target
+region that is then shared with every thread in a parallel region. This allows
+the user to declare shared memory without using custom allocators. A simple
+stencil calculation shows how this can be used.
+
+.. code-block:: c++
+
+ void stencil(int M, int N, double *X, double *Y) {
+ #pragma omp target teams distribute collapse(2) \
+ map(to : X [0:M * N]) map(tofrom : Y [0:M * N])
+ for (int i0 = 0; i0 < M; i0 += MC) {
+ for (int j0 = 0; j0 < N; j0 += NC) {
+ double sX[MC][NC];
+
+ #pragma omp parallel for collapse(2) default(firstprivate)
+ for (int i1 = 0; i1 < MC; ++i1)
+ for (int j1 = 0; j1 < NC; ++j1)
+ sX[i1][j1] = X[(i0 + i1) * N + (j0 + j1)];
+
+ #pragma omp parallel for collapse(2) default(firstprivate)
+ for (int i1 = 1; i1 < MC - 1; ++i1)
+ for (int j1 = 1; j1 < NC - 1; ++j1)
+ Y[(i0 + i1) * N + j0 * j1] = (sX[i1 + 1][j1] + sX[i1 - 1][j1] +
+ sX[i1][j1 + 1] + sX[i1][j1 - 1] +
+ -4.0 * sX[i1][j1]) / (dX * dX);
+ }
+ }
+ }
+
+.. code-block:: console
+
+
+ $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O1 -Rpass=openmp-opt -fopenmp-version=51 omp111.cpp
+ omp111.cpp:10:14: remark: Replaced globalized variable with 8192 bytes of shared memory. [OMP111]
+ double sX[MC][NC];
+ ^
+
+Diagnostic Scope
+----------------
+
+OpenMP target offloading optimization remark.
diff --git a/openmp/docs/remarks/OMP112.rst b/openmp/docs/remarks/OMP112.rst
new file mode 100644
index 0000000000000..b4b7c7d48f07d
--- /dev/null
+++ b/openmp/docs/remarks/OMP112.rst
@@ -0,0 +1,89 @@
+Found thread data sharing on the GPU. Expect degraded performance due to data globalization. [OMP112]
+=====================================================================================================
+
+.. _omp112:
+
+This missed remark indicates that a globalized value was found on the target
+device that was not either replaced with stack memory by :ref:`OMP110 <omp110>`
+or shared memory by :ref:`OMP111 <omp111>`. Globalization that has not been
+removed will need to be handled by the runtime and will significantly hurt
+performance.
+
+The OpenMP standard expects that threads can always share their data between
+each-other. However, this is not true by default when offloading to a target
+device such as a GPU. Threads on a GPU cannot shared their data unless it is
+first placed in global or shared memory. In order to create standards complaint
+code, the Clang compiler will globalize any variables that could potentially be
+shared between the threads. In the majority of cases, globalized variables can
+either be returns to a thread-local stack, or pushed to shared memory. However,
+in a few cases it is necessary and will cause a performance penalty.
+
+Examples
+--------
+
+This example shows legitimate data sharing on the device. It is a convoluted
+example, but is completely complaint with the OpenMP standard. If globalization
+was not added this would result in
diff erent results on
diff erent target
+devices.
+
+.. code-block:: c++
+
+ #include <omp.h>
+ #include <cstdio>
+
+ #pragma omp declare target
+ static int *p;
+ #pragma omp end declare target
+
+ void foo() {
+ int x = omp_get_thread_num();
+ if (omp_get_thread_num() == 1)
+ p = &x;
+
+ #pragma omp barrier
+
+ printf ("Thread %d: %d\n", omp_get_thread_num(), *p);
+ }
+
+ int main() {
+ #pragma omp target parallel
+ foo();
+ }
+
+.. code-block:: console
+
+ $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O1 -Rpass-missed=openmp-opt omp112.cpp
+ omp112.cpp:9:7: remark: Found thread data sharing on the GPU. Expect degraded performance
+ due to data globalization. [OMP112] [-Rpass-missed=openmp-opt]
+ int x = omp_get_thread_num();
+ ^
+
+A less convoluted example globalization that cannot be removed occurs when
+calling functions that aren't visible from the current translation unit.
+
+.. code-block:: c++
+
+ extern void use(int *x);
+
+ void foo() {
+ int x;
+ use(&x);
+ }
+
+ int main() {
+ #pragma omp target parallel
+ foo();
+ }
+
+.. code-block:: console
+
+ $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O1 -Rpass-missed=openmp-opt omp112.cpp
+ omp112.cpp:4:7: remark: Found thread data sharing on the GPU. Expect degraded performance
+ due to data globalization. [OMP112] [-Rpass-missed=openmp-opt]
+ int x;
+ ^
+
+Diagnostic Scope
+----------------
+
+OpenMP target offloading missed remark.
diff --git a/openmp/docs/remarks/OMP113.rst b/openmp/docs/remarks/OMP113.rst
new file mode 100644
index 0000000000000..d319f7ab68fb5
--- /dev/null
+++ b/openmp/docs/remarks/OMP113.rst
@@ -0,0 +1,81 @@
+Could not move globalized variable to the stack. Variable is potentially captured in call. Mark parameter as `__attribute__((noescape))` to override. [OMP113]
+==============================================================================================================================================================
+
+.. _omp113:
+
+This missed remark indicates that a globalized value could not be moved to the
+stack because it is potentially captured by a call to a function we cannot
+analyze. In order for a globalized variable to be moved to the stack, copies to
+its pointer cannot be stored. Otherwise it is considered captured and could
+potentially be shared between the threads. This can be overridden using a
+parameter level attribute as suggested in the remark text.
+
+Globalization will occur when a pointer to a thread-local variable escapes
+the current scope. In most cases it can be determined that the variable cannot
+be shared if a copy of its pointer is never made. However, this remark indicates
+a copy of the variable either is present, or is possible because it is used
+outside the current translation unit.
+
+Examples
+--------
+
+If a pointer to a thread-local variable is passed to a function not visible in
+the current translation unit we need to assume a copy is made of it that can be
+shared between the threads. This prevents :ref:`OMP110 <omp110>` from
+triggering, which will result in a performance penalty when executing on the
+target device.
+
+.. code-block:: c++
+
+ extern void use(int *x);
+
+ void foo() {
+ int x;
+ use(&x);
+ }
+
+ int main() {
+ #pragma omp target parallel
+ foo();
+ }
+
+.. code-block:: console
+
+ $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O2 -Rpass-missed=openmp-opt omp113.cpp
+ missed.cpp:4:7: remark: Could not move globalized variable to the stack. Variable is
+ potentially captured in call. Mark parameter as `__attribute__((noescape))` to
+ override. [OMP113]
+ int x;
+ ^
+
+As the remark suggests, this behaviour can be overridden using the ``noescape``
+attribute. This tells the compiler that no reference to the object the pointer
+points to that is derived from the parameter value will survive after the
+function returns. The user is responsible for verifying that this assertion is
+correct.
+
+.. code-block:: c++
+
+ extern void use(__attribute__((noescape)) int *x);
+
+ void foo() {
+ int x;
+ use(&x);
+ }
+
+ int main() {
+ #pragma omp target parallel
+ foo();
+ }
+
+.. code-block:: console
+
+ $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O2 -Rpass=openmp-opt omp113.cpp
+ missed.cpp:4:7: remark: Moving globalized variable to the stack. [OMP110]
+ int x;
+ ^
+
+Diagnostic Scope
+----------------
+
+OpenMP target offloading missed remark.
diff --git a/openmp/docs/remarks/OMP120.rst b/openmp/docs/remarks/OMP120.rst
new file mode 100644
index 0000000000000..d3c626db34415
--- /dev/null
+++ b/openmp/docs/remarks/OMP120.rst
@@ -0,0 +1,93 @@
+Transformed generic-mode kernel to SPMD-mode [OMP120]
+=====================================================
+
+.. _omp120:
+
+This optimization remark indicates that the execution strategy for the OpenMP
+target offloading kernel was changed. Generic-mode kernels execute by using a
+single thread to schedule parallel worker threads using a state machine. This
+code transformation can move a kernel that was initially generated in generic
+mode to SPMD-mode where all threads are active at the same time with no state
+machine. This is only possible if the instructions previously executed by a
+single thread have no side-effects or can be guarded. If the instructions have
+no side-effects they are simply recomputed by each thread.
+
+Generic-mode is often considerably slower than SPMD-mode because of the extra
+overhead required to separately schedule worker threads and pass data between
+them.This optimization allows users to use generic-mode semantics while
+achieving the performance of SPMD-mode. This can be helpful when defining shared
+memory between the threads using :ref:`OMP111 <omp111>`.
+
+Examples
+--------
+
+Normally, any kernel that contains split OpenMP target and parallel regions will
+be executed in generic-mode. Sometimes it is easier to use generic-mode
+semantics to define shared memory, or more tightly control the distribution of
+the threads. This shows a naive matrix-matrix multiplication that contains code
+that will need to be guarded.
+
+.. code-block:: c++
+
+ void matmul(int M, int N, int K, double *A, double *B, double *C) {
+ #pragma omp target teams distribute collapse(2) \
+ map(to:A[0: M*K]) map(to:B[0: K*N]) map(tofrom:C[0 : M*N])
+ for (int i = 0; i < M; i++) {
+ for (int j = 0; j < N; j++) {
+ double sum = 0.0;
+
+ #pragma omp parallel for reduction(+:sum) default(firstprivate)
+ for (int k = 0; k < K; k++)
+ sum += A[i*K + k] * B[k*N + j];
+
+ C[i*N + j] = sum;
+ }
+ }
+ }
+
+.. code-block:: console
+
+ $ clang++ -fopenmp -fopenmp-targets=nvptx64 -fopenmp-version=51 -O2 -Rpass=openmp-opt omp120.cpp
+ omp120.cpp:6:14: remark: Replaced globalized variable with 8 bytes of shared memory. [OMP111]
+ double sum = 0.0;
+ ^
+ omp120.cpp:2:1: remark: Transformed generic-mode kernel to SPMD-mode. [OMP120]
+ #pragma omp target teams distribute collapse(2) \
+ ^
+
+This requires guarding the store to the shared variable ``sum`` and the store to
+the matrix ``C``. This can be thought of as generating the code below.
+
+.. code-block:: c++
+
+ void matmul(int M, int N, int K, double *A, double *B, double *C) {
+ #pragma omp target teams distribute collapse(2) \
+ map(to:A[0: M*K]) map(to:B[0: K*N]) map(tofrom:C[0 : M*N])
+ for (int i = 0; i < M; i++) {
+ for (int j = 0; j < N; j++) {
+ double sum;
+ #pragma omp parallel default(firstprivate) shared(sum)
+ {
+ #pragma omp barrier
+ if (omp_get_thread_num() == 0)
+ sum = 0.0;
+ #pragma omp barrier
+
+ #pragma omp for reduction(+:sum)
+ for (int k = 0; k < K; k++)
+ sum += A[i*K + k] * B[k*N + j];
+
+ #pragma omp barrier
+ if (omp_get_thread_num() == 0)
+ C[i*N + j] = sum;
+ #pragma omp barrier
+ }
+ }
+ }
+ }
+
+
+Diagnostic Scope
+----------------
+
+OpenMP target offloading optimization remark.
diff --git a/openmp/docs/remarks/OMP121.rst b/openmp/docs/remarks/OMP121.rst
new file mode 100644
index 0000000000000..baec1e5b841e5
--- /dev/null
+++ b/openmp/docs/remarks/OMP121.rst
@@ -0,0 +1,80 @@
+Value has potential side effects preventing SPMD-mode execution. Add `__attribute__((assume(\"ompx_spmd_amenable\")))` to the called function to override. [OMP121]
+===================================================================================================================================================================
+
+.. _omp121:
+
+This analysis remarks indicates that a potential side-effect that cannot be
+guarded prevents the target region from executing in SPMD-mode. SPMD-mode
+requires that each thread is active inside the region. Any instruction that
+cannot be either recomputed by each thread independently or guarded and executed
+by a single thread prevents the region from executing in SPMD-mode.
+
+This remark will attempt to print out the instructions preventing the region
+from being executed in SPMD-mode. Calls to functions outside the current
+translation unit will prevent this transformation from occurring as well, but
+can be overridden using an assumption stating that it contains no calls that
+prevent SPMD execution.
+
+Examples
+--------
+
+Calls to functions outside the current translation unit may contain instructions
+or operations that cannot be executed in SPMD-mode.
+
+.. code-block:: c++
+
+ extern int work();
+
+ void use(int x);
+
+ void foo() {
+ #pragma omp target teams
+ {
+ int x = work();
+ #pragma omp parallel
+ use(x);
+
+ }
+ }
+
+
+.. code-block:: console
+
+ $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O2 -Rpass-analysis=openmp-opt omp121.cpp
+ omp121.cpp:8:13: remark: Value has potential side effects preventing SPMD-mode
+ execution. Add `__attribute__((assume("ompx_spmd_amenable")))` to the called function
+ to override. [OMP121]
+ int x = work();
+ ^
+
+As the remark suggests, the problem is caused by the unknown call to the
+external function ``work``. This can be overridden by asserting that it does not
+contain any code that prevents SPMD-mode execution.
+
+.. code-block:: c++
+
+ __attribute__((assume("ompx_spmd_amenable"))) extern int work();
+
+ void use(int x);
+
+ void foo() {
+ #pragma omp target teams
+ {
+ int x = work();
+ #pragma omp parallel
+ use(x);
+
+ }
+ }
+
+.. code-block:: console
+
+ $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O2 -Rpass=openmp-opt omp121.cpp
+ omp121.cpp:6:1: remark: Transformed generic-mode kernel to SPMD-mode. [OMP120]
+ #pragma omp target teams
+ ^
+
+Diagnostic Scope
+----------------
+
+OpenMP target offloading analysis remark.
diff --git a/openmp/docs/remarks/OMP130.rst b/openmp/docs/remarks/OMP130.rst
new file mode 100644
index 0000000000000..9ba8d24efe135
--- /dev/null
+++ b/openmp/docs/remarks/OMP130.rst
@@ -0,0 +1,36 @@
+Removing unused state machine from generic-mode kernel. [OMP130]
+================================================================
+
+.. _omp130:
+
+This optimization remark indicates that an unused state machine was removed from
+a target region. This occurs when there are no parallel regions inside of a
+target construct. Normally, a state machine is required to schedule the threads
+inside of a parallel region. If there are no parallel regions, the state machine
+is unnecessary because there is only a single thread active at any time.
+
+Examples
+--------
+
+This optimization should occur on any target region that does not contain any
+parallel work.
+
+.. code-block:: c++
+
+ void copy(int N, double *X, double *Y) {
+ #pragma omp target teams distribute map(tofrom: X[0:N]) map(tofrom: Y[0:N])
+ for (int i = 0; i < N; ++i)
+ Y[i] = X[i];
+ }
+
+.. code-block:: console
+
+ $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O2 -Rpass=openmp-opt omp130.cpp
+ omp130.cpp:2:1: remark: Removing unused state machine from generic-mode kernel. [OMP130]
+ #pragma omp target teams distribute map(tofrom: X[0:N]) map(tofrom: Y[0:N])
+ ^
+
+Diagnostic Scope
+----------------
+
+OpenMP target offloading optimization remark.
diff --git a/openmp/docs/remarks/OMP131.rst b/openmp/docs/remarks/OMP131.rst
new file mode 100644
index 0000000000000..0864fc8cd7f1a
--- /dev/null
+++ b/openmp/docs/remarks/OMP131.rst
@@ -0,0 +1,45 @@
+Rewriting generic-mode kernel with a customized state machine. [OMP131]
+=======================================================================
+
+.. _omp131:
+
+This optimization remark indicates that a generic-mode kernel on the device was
+specialized for the given target region. When offloading in generic-mode, a
+state machine is required to schedule the work between the parallel worker
+threads. This optimization specializes the state machine in cases where there is
+a known number of parallel regions inside the kernel.
+
+Examples
+--------
+
+This optimization should occur on any generic-mode kernel that has visibility on
+all parallel regions, but cannot be moved to SPMD-mode.
+
+.. code-block:: c++
+
+ #pragma omp declare target
+ int TID;
+ #pragma omp end declare target
+
+ void foo() {
+ #pragma omp target
+ {
+ TID = omp_get_thread_num();
+ #pragma omp parallel
+ {
+ work();
+ }
+ }
+ }
+
+.. code-block:: console
+
+ $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O2 -Rpass=openmp-opt omp131.cpp
+ omp131.cpp:8:1: remark: Rewriting generic-mode kernel with a customized state machine. [OMP131]
+ #pragma omp target
+ ^
+
+Diagnostic Scope
+----------------
+
+OpenMP target offloading optimization remark.
diff --git a/openmp/docs/remarks/OMP132.rst b/openmp/docs/remarks/OMP132.rst
new file mode 100644
index 0000000000000..6ba518f0cde76
--- /dev/null
+++ b/openmp/docs/remarks/OMP132.rst
@@ -0,0 +1,45 @@
+Generic-mode kernel is executed with a customized state machine that requires a fallback. [OMP132]
+==================================================================================================
+
+.. _omp132:
+
+This analysis remark indicates that a state machine rewrite occurred, but
+could not be done fully because of unknown calls to functions that may contain
+parallel regions. The state machine handles scheduling work between parallel
+worker threads on the device when operating in generic-mode. If there are
+unknown parallel regions it prevents the optimization from fully rewriting the
+state machine.
+
+Examples
+--------
+
+This will occur for any generic-mode kernel that may contain unknown parallel
+regions. This is typically coupled with the :ref:`OMP133 <omp133>` remark.
+
+.. code-block:: c++
+
+ extern void setup();
+
+ void foo() {
+ #pragma omp target
+ {
+ setup();
+ #pragma omp parallel
+ {
+ work();
+ }
+ }
+ }
+
+.. code-block:: console
+
+ $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O2 -Rpass-analysis=openmp-opt omp132.cpp
+ omp133.cpp:4:1: remark: Generic-mode kernel is executed with a customized state machine
+ that requires a fallback. [OMP132]
+ #pragma omp target
+ ^
+
+Diagnostic Scope
+----------------
+
+OpenMP target offloading analysis remark.
diff --git a/openmp/docs/remarks/OMP133.rst b/openmp/docs/remarks/OMP133.rst
new file mode 100644
index 0000000000000..3bc1a33ce8cb9
--- /dev/null
+++ b/openmp/docs/remarks/OMP133.rst
@@ -0,0 +1,70 @@
+Call may contain unknown parallel regions. Use `__attribute__((assume("omp_no_parallelism")))` to override. [OMP133]
+====================================================================================================================
+
+.. _omp133:
+
+This analysis remark identifies calls that prevented :ref:`OMP131 <omp131>` from
+providing the generic-mode kernel with a fully specialized state machine. This
+remark will identify each call that may contain unknown parallel regions that
+caused the kernel to require a fallback.
+
+Examples
+--------
+
+This will occur for any generic-mode kernel that may contain unknown parallel
+regions. This is typically coupled with the :ref:`OMP132 <omp132>` remark.
+
+.. code-block:: c++
+
+ extern void setup();
+
+ void foo() {
+ #pragma omp target
+ {
+ setup();
+ #pragma omp parallel
+ {
+ work();
+ }
+ }
+ }
+
+.. code-block:: console
+
+ $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O2 -Rpass-analysis=openmp-opt omp133.cpp
+ omp133.cpp:6:5: remark: Call may contain unknown parallel regions. Use
+ `__attribute__((assume("omp_no_parallelism")))` to override. [OMP133]
+ setup();
+ ^
+
+The remark suggests marking the function with the assumption that it contains no
+parallel regions. If this is done then the kernel will be rewritten with a fully
+specialized state machine.
+
+.. code-block:: c++
+
+ __attribute__((assume("omp_no_parallelism"))) extern void setup();
+
+
+ void foo() {
+ #pragma omp target
+ {
+ setup();
+ #pragma omp parallel
+ {
+ work();
+ }
+ }
+ }
+
+.. code-block:: console
+
+ $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O2 -Rpass=openmp-opt omp133.cpp
+ omp133.cpp:4:1: remark: Rewriting generic-mode kernel with a customized state machine. [OMP131]
+ #pragma omp target
+ ^
+
+Diagnostic Scope
+----------------
+
+OpenMP target offloading analysis remark.
diff --git a/openmp/docs/remarks/OMP140.rst b/openmp/docs/remarks/OMP140.rst
new file mode 100644
index 0000000000000..57fc4653267e2
--- /dev/null
+++ b/openmp/docs/remarks/OMP140.rst
@@ -0,0 +1,49 @@
+Could not internalize function. Some optimizations may not be possible. [OMP140]
+====================================================================================================================
+
+.. _omp140:
+
+This analysis remark indicates that function internalization failed for the
+given function. Internalization occurs when a call to a function that ordinarily
+has external visibility is replaced with a call to a copy of that function with
+only internal visibility. This allows the compiler to make strong static
+assertions about the context a function is called in. Without internalization
+this analysis would always be invalidated by the possibility of someone calling
+the function in a
diff erent context outside of the current translation unit.
+This is necessary for optimizations like :ref:`OMP111 <omp111>` and :ref:`OMP120
+<omp120>`. If a function failed to be internalized it most likely has linkage
+that cannot be copied. Internalization is currently only enabled by default for
+OpenMP target offloading.
+
+Examples
+--------
+
+This will occur for any function declaration that has incompatible linkage.
+
+.. code-block:: c++
+
+ __attribute__((weak)) void setup();
+
+ void foo() {
+ #pragma omp target
+ {
+ setup();
+ #pragma omp parallel
+ {
+ work();
+ }
+ }
+ }
+
+.. code-block:: console
+
+ $ clang++ -fopenmp -fopenmp-targets=nvptx64 -O1 -Rpass-analysis=openmp-opt omp140.cpp
+ omp140.cpp:1:1: remark: Could not internalize function. Some optimizations may not
+ be possible. [OMP140]
+ __attribute__((weak)) void setup() {
+ ^
+
+Diagnostic Scope
+----------------
+
+OpenMP analysis remark.
diff --git a/openmp/docs/remarks/OMP150.rst b/openmp/docs/remarks/OMP150.rst
new file mode 100644
index 0000000000000..0016a58955ac1
--- /dev/null
+++ b/openmp/docs/remarks/OMP150.rst
@@ -0,0 +1,42 @@
+Parallel region merged with parallel region at <location>. [OMP150]
+===================================================================
+
+.. _omp150:
+
+This optimization remark indicates that a parallel region was merged with others
+into a single parallel region. Parallel region merging fuses consecutive
+parallel regions to reduce the team activation overhead of forking and increases
+the scope of possible OpenMP-specific optimizations within merged parallel
+regions. This optimization can also guard sequential code between two parallel
+regions if applicable.
+
+Example
+-------
+
+This optimization should apply to any compatible and consecutive parallel
+regions. In this case the sequential region between the parallel regions will be
+guarded so it is only executed by a single thread in the new merged region.
+
+.. code-block:: c++
+
+ void foo() {
+ #pragma omp parallel
+ parallel_work();
+
+ sequential_work();
+
+ #pragma omp parallel
+ parallel_work();
+ }
+
+.. code-block:: console
+
+ $ clang++ -fopenmp -O2 -Rpass=openmp-opt -mllvm -openmp-opt-enable-merging omp150.cpp
+ omp150.cpp:2:1: remark: Parallel region merged with parallel region at merge.cpp:7:1. [OMP150]
+ #pragma omp parallel
+ ^
+
+Diagnostic Scope
+----------------
+
+OpenMP optimization remark.
diff --git a/openmp/docs/remarks/OMP160.rst b/openmp/docs/remarks/OMP160.rst
new file mode 100644
index 0000000000000..a65315f844161
--- /dev/null
+++ b/openmp/docs/remarks/OMP160.rst
@@ -0,0 +1,44 @@
+Removing parallel region with no side-effects. [OMP160]
+=======================================================
+
+.. _omp160:
+
+This optimization remark indicates that a parallel region was deleted because it
+was not found to have any side-effects. This can occur if the region does not
+write any of its results to memory visible outside the region. This optimization
+is necessary because the barrier between sequential and parallel code typically
+prevents dead code elimination from completely removing the region. Otherwise
+there will still be overhead to fork and merge the threads.
+
+Example
+-------
+
+This optimization occurs whenever a parallel region was not found to have any
+side-effects. This can occur if the parallel region only reads memory or is
+simply empty.
+
+.. code-block:: c++
+
+ void foo() {
+ #pragma omp parallel
+ { }
+ #pragma omp parallel
+ { int x = 1; }
+ }
+ }
+
+.. code-block:: console
+
+ $ clang++ -fopenmp -O2 -Rpass=openmp-opt omp160.cpp
+ omp160.cpp:4:1: remark: Removing parallel region with no side-effects. [OMP160] [-Rpass=openmp-opt]
+ #pragma omp parallel
+ ^
+ delete.cpp:2:1: remark: Removing parallel region with no side-effects. [OMP160] [-Rpass=openmp-opt]
+ #pragma omp parallel
+ ^
+ ^
+
+Diagnostic Scope
+----------------
+
+OpenMP optimization remark.
diff --git a/openmp/docs/remarks/OMP170.rst b/openmp/docs/remarks/OMP170.rst
new file mode 100644
index 0000000000000..5fcb540fc6b86
--- /dev/null
+++ b/openmp/docs/remarks/OMP170.rst
@@ -0,0 +1,41 @@
+OpenMP runtime call <call> deduplicated. [OMP170]
+====================================================================
+
+.. _omp170:
+
+This optimization remark indicates that a call to an OpenMP runtime call was
+replaced with the result of an existing one. This occurs when the compiler knows
+that the result of a runtime call is immutable. Removing duplicate calls is done
+by replacing all calls to that function with the result of the first call. This
+cannot be done automatically by the compiler because the implementations of the
+OpenMP runtime calls live in a separate library the compiler cannot see.
+
+Example
+-------
+
+This optimization will trigger for known OpenMP runtime calls whose return value
+will not change.
+
+.. code-block:: c++
+
+ void foo(int N) {
+ double *A = malloc(N * omp_get_thread_limit());
+ double *B = malloc(N * omp_get_thread_limit());
+
+ #pragma omp parallel
+ work(&A[omp_get_thread_num() * N]);
+ #pragma omp parallel
+ work(&B[omp_get_thread_num() * N]);
+ }
+
+.. code-block:: console
+
+ $ clang -fopenmp -O2 -Rpass=openmp-opt omp170.c
+ ompi170.c:2:26: remark: OpenMP runtime call omp_get_thread_limit deduplicated. [OMP170]
+ double *A = malloc(N * omp_get_thread_limit());
+ ^
+
+Diagnostic Scope
+----------------
+
+OpenMP optimization remark.
diff --git a/openmp/docs/remarks/OptimizationRemarks.rst b/openmp/docs/remarks/OptimizationRemarks.rst
index 4c256fd996944..1463a61be40e2 100644
--- a/openmp/docs/remarks/OptimizationRemarks.rst
+++ b/openmp/docs/remarks/OptimizationRemarks.rst
@@ -15,41 +15,95 @@ features of the remark system the clang documentation should be consulted:
<https://clang.llvm.org/docs/ClangCommandLineReference.html#cmdoption-clang1-fsave-optimization-record>`_
-.. _ompXXX:
-
-Some OpenMP remarks start with a "tag", like `[OMP100]`, which indicates that
-there is further information about them on this page. To directly jump to the
-respective entry, navigate to
-`https://openmp.llvm.org/docs/remarks/OptimizationRemarks.html#ompXXX <https://openmp.llvm.org/docs/remarks/OptimizationRemarks.html#ompXXX>`_ where `XXX` is
-the three digit code shown in the tag.
-
-
-----
-
-
-.. _omp100:
-.. _omp_no_external_caller_in_target_region:
-
-`[OMP100]` Potentially unknown OpenMP target region caller
-----------------------------------------------------------
-
-A function remark that indicates the function, when compiled for a GPU, is
-potentially called from outside the translation unit. Note that a remark is
-only issued if we tried to perform an optimization which would require us to
-know all callers on the GPU.
-
-To facilitate OpenMP semantics on GPUs we provide a runtime mechanism through
-which the code that makes up the body of a parallel region is shared with the
-threads in the team. Generally we use the address of the outlined parallel
-region to identify the code that needs to be executed. If we know all target
-regions that reach the parallel region we can avoid this function pointer
-passing scheme and often improve the register usage on the GPU. However, If a
-parallel region on the GPU is in a function with external linkage we may not
-know all callers statically. If there are outside callers within target
-regions, this remark is to be ignored. If there are no such callers, users can
-modify the linkage and thereby help optimization with a `static` or
-`__attribute__((internal))` function annotation. If changing the linkage is
-impossible, e.g., because there are outside callers on the host, one can split
-the function into an external visible interface which is not compiled for
-the target and an internal implementation which is compiled for the target
-and should be called from within the target region.
+OpenMP Remarks
+--------------
+
+.. toctree::
+ :hidden:
+ :maxdepth: 1
+
+ OMP100
+ OMP101
+ OMP102
+ OMP110
+ OMP111
+ OMP112
+ OMP113
+ OMP120
+ OMP121
+ OMP130
+ OMP131
+ OMP132
+ OMP133
+ OMP140
+ OMP150
+ OMP160
+ OMP170
+
+.. list-table::
+ :widths: 15 15 70
+ :header-rows: 1
+
+ * - Diagnostics Number
+ - Diagnostics Kind
+ - Diagnostics Description
+ * - :ref:`OMP100 <omp100>`
+ - Analysis
+ - Potentially unknown OpenMP target region caller.
+ * - :ref:`OMP101 <omp101>`
+ - Analysis
+ - Parallel region is used in unknown / unexpected ways. Will not attempt to
+ rewrite the state machine.
+ * - :ref:`OMP102 <omp102>`
+ - Analysis
+ - Parallel region is not called from a unique kernel. Will not attempt to
+ rewrite the state machine.
+ * - :ref:`OMP110 <omp110>`
+ - Optimization
+ - Moving globalized variable to the stack.
+ * - :ref:`OMP111 <omp111>`
+ - Optimization
+ - Replaced globalized variable with X bytes of shared memory.
+ * - :ref:`OMP112 <omp112>`
+ - Missed
+ - Found thread data sharing on the GPU. Expect degraded performance due to
+ data globalization.
+ * - :ref:`OMP113 <omp113>`
+ - Missed
+ - Could not move globalized variable to the stack. Variable is potentially
+ captured in call. Mark parameter as `__attribute__((noescape))` to
+ override.
+ * - :ref:`OMP120 <omp120>`
+ - Optimization
+ - Transformed generic-mode kernel to SPMD-mode.
+ * - :ref:`OMP121 <omp121>`
+ - Analysis
+ - Value has potential side effects preventing SPMD-mode execution. Add
+ `__attribute__((assume(\"ompx_spmd_amenable\")))` to the called function
+ to override.
+ * - :ref:`OMP130 <omp130>`
+ - Optimization
+ - Removing unused state machine from generic-mode kernel.
+ * - :ref:`OMP131 <omp131>`
+ - Optimization
+ - Rewriting generic-mode kernel with a customized state machine.
+ * - :ref:`OMP132 <omp132>`
+ - Analysis
+ - Generic-mode kernel is executed with a customized state machine that
+ requires a fallback.
+ * - :ref:`OMP133 <omp133>`
+ - Analysis
+ - Call may contain unknown parallel regions. Use
+ `__attribute__((assume("omp_no_parallelism")))` to override.
+ * - :ref:`OMP140 <omp140>`
+ - Analysis
+ - Could not internalize function. Some optimizations may not be possible.
+ * - :ref:`OMP150 <omp150>`
+ - Optimization
+ - Parallel region merged with parallel region at <location>.
+ * - :ref:`OMP160 <omp160>`
+ - Optimization
+ - Removing parallel region with no side-effects.
+ * - :ref:`OMP170 <omp170>`
+ - Optimization
+ - OpenMP runtime call <call> deduplicated.
More information about the Openmp-commits
mailing list