[llvm] Adding Separate OpenMP Offloading Backend to `libcxx/include/__algorithm/pstl_backends` (PR #66968)

Mon Oct 30 13:27:43 PDT 2023

================
@@ -0,0 +1,81 @@
+//===----------------------------------------------------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef _LIBCPP___ALGORITHM_PSTL_BACKENDS_OPENMP_BACKEND_H
+#define _LIBCPP___ALGORITHM_PSTL_BACKENDS_OPENMP_BACKEND_H
+
+#include <__config>
+
+/*
+Combined OpenMP CPU and GPU Backend
+===================================
+Contrary to the CPU backends found in ./cpu_backends/, the OpenMP backend can
+target both CPUs and GPUs. The OpenMP standard defines that when offloading code
+to an accelerator, the compiler must generate a fallback code for execution on
+the host. Thereby, the backend works as a CPU backend if no targeted accelerator
+is available at execution time. The target regions can also be compiled directly
+for a CPU architecture, for instance by adding the command-line option
+`-fopenmp-targets=x86_64-pc-linux-gnu` in Clang.
+
+When is an Algorithm Offloaded?
+-------------------------------
+Only parallel algorithms with the parallel unsequenced execution policy are
+offloaded to the device. We cannot offload parallel algorithms with a parallel
+execution policy to GPUs because invocations executing in the same thread "are
+indeterminately sequenced with respect to each other" which we cannot guarantee
+on a GPU.
+
+The standard draft states that "the semantics [...] allow the implementation to
+fall back to sequential execution if the system cannot parallelize an algorithm
+invocation". If it is not deemed safe to offload the parallel algorithm to the
+device, we first fall back to a parallel unsequenced implementation from
+./cpu_backends. The CPU implementation may then fall back to sequential
+execution. In that way we strive to achieve the best possible performance.
+
+Further, "it is the caller's responsibility to ensure that the invocation does
+not introduce data races or deadlocks."
+
+Implicit Assumptions
+--------------------
+If the user provides a function pointer as an argument to a parallel algorithm,
+it is assumed that it is the device pointer as there is currently no way to
+check whether a host or device pointer was passed.
+
+Mapping Clauses
+---------------
+In some of the parallel algorithms, the user is allowed to provide the same
+iterator as input and output. Hence, the order of the maps matters. Therefore,
+`pragma omp target data map(to:...)` must be used before
+`pragma omp target data map(alloc:...)`. Conversely, the maps with map modifier
+`release` must be placed before the maps with map modifier `from` when
+transferring the result from the device to the host.
----------------
ldionne wrote:

Can you explain why this matters? I don't remember the exact reason but I remember it was insightful.

https://github.com/llvm/llvm-project/pull/66968