[clang] [NFC][docs][HIP] Update HIP docs around `hipstdpar` and SPIR-V (PR #124803)

Tue Jan 28 11:22:50 PST 2025

https://github.com/AlexVlx updated https://github.com/llvm/llvm-project/pull/124803

>From ed7ee56acc5434233191cfb1c556165f193c4005 Mon Sep 17 00:00:00 2001
From: Alex Voicu <alexandru.voicu at amd.com>
Date: Tue, 28 Jan 2025 17:45:07 +0000
Subject: [PATCH 1/3] Add missing `hipstdpar` documentation; remove reference
 to temporary SPIR-V limitation which got lifted.

---
 clang/docs/HIPSupport.rst | 402 +++++++++++++++++++++++++++++++++++++-
 1 file changed, 394 insertions(+), 8 deletions(-)

diff --git a/clang/docs/HIPSupport.rst b/clang/docs/HIPSupport.rst
index e830acd8dd85c0..34abd389b36fe4 100644
--- a/clang/docs/HIPSupport.rst
+++ b/clang/docs/HIPSupport.rst
@@ -27,7 +27,8 @@ AMD GPU Support
 Clang provides HIP support on AMD GPUs via the ROCm platform `<https://rocm.docs.amd.com/en/latest/#>`_.
 The ROCm runtime forms the base for HIP host APIs, while HIP device APIs are realized through HIP header
 files and the ROCm device library. The Clang driver uses the HIPAMD toolchain to compile HIP device code
-to AMDGPU ISA via the AMDGPU backend. The compiled code is then bundled and embedded in the host executables.
+to AMDGPU ISA via the AMDGPU backend, or SPIR-V via the workflow outlined below.
+The compiled code is then bundled and embedded in the host executables.
 
 Intel GPU Support
 =================
@@ -285,6 +286,398 @@ Example Usage
       basePtr->virtualFunction(); // Allowed since obj is constructed in device code
    }
 
+C++ Standard Parallelism Offload Support: Compiler And Runtime
+==============================================================
+
+Introduction
+============
+
+This section describes the implementation of support for offloading the
+execution of standard C++ algorithms to accelerators that can be targeted via
+HIP. Furthermore, it enumerates restrictions on user defined code, as well as
+the interactions with runtimes.
+
+Algorithm Offload: What, Why, Where
+===================================
+
+C++17 introduced overloads
+`for most algorithms in the standard library <https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0024r2.html>`_
+which allow the user to specify a desired
+`execution policy <https://en.cppreference.com/w/cpp/algorithm#Execution_policies>`_.
+The `parallel_unsequenced_policy <https://en.cppreference.com/w/cpp/algorithm/execution_policy_tag_t>`_
+maps relatively well to the execution model of AMD GPUs. This, coupled with the
+the availability and maturity of GPU accelerated algorithm libraries that
+implement most / all corresponding algorithms in the standard library
+(e.g. `rocThrust <https://github.com/ROCmSoftwarePlatform/rocThrust>`_), makes
+it feasible to provide seamless accelerator offload for supported algorithms,
+when an accelerated version exists. Thus, it becomes possible to easily access
+the computational resources of an AMD accelerator, via a well specified,
+familiar, algorithmic interface, without having to delve into low-level hardware
+specific details. Putting it all together:
+
+- **What**: standard library algorithms, when invoked with the
+  ``parallel_unsequenced_policy``
+- **Why**: democratise AMDGPU accelerator programming, without loss of user
+  familiarity
+- **Where**: only AMDGPU accelerators targeted by Clang/LLVM via HIP
+
+Small Example
+=============
+
+Given the following C++ code:
+
+.. code-block:: C++
+
+   bool has_the_answer(const std::vector<int>& v) {
+     return std::find(std::execution::par_unseq, std::cbegin(v), std::cend(v), 42) != std::cend(v);
+   }
+
+if Clang is invoked with the ``--hipstdpar --offload-arch=foo`` flags, the call
+to ``find`` will be offloaded to an accelerator that is part of the ``foo``
+target family. If either ``foo`` or its runtime environment do not support
+transparent on-demand paging (such as e.g. that provided in Linux via
+`HMM <https://docs.kernel.org/mm/hmm.html>`_), it is necessary to also include
+the ``--hipstdpar-interpose-alloc`` flag. If the accelerator specific algorithm
+library ``foo`` uses doesn't have an implementation of a particular algorithm,
+execution seamlessly falls back to the host CPU. It is legal to specify multiple
+``--offload-arch``s. All the flags we introduce, as well as a thorough view of
+various restrictions and their implications will be provided below.
+
+Implementation - General View
+=============================
+
+We built support for Algorithm Offload support atop the pre-existing HIP
+infrastructure. More specifically, when one requests offload via ``--hipstdpar``,
+compilation is switched to HIP compilation, as if ``-x hip`` was specified.
+Similarly, linking is also switched to HIP linking, as if ``--hip-link`` was
+specified. Note that these are implicit, and one should not assume that any
+interop with HIP specific language constructs is available e.g. ``__device__``
+annotations are neither necessary nor guaranteed to work.
+
+Since there are no language restriction mechanisms in place, it is necessary to
+relax HIP language specific semantic checks performed by the FE; they would
+identify otherwise valid, offloadable code, as invalid HIP code. Given that we
+know that the user intended only for certain algorithms to be offloaded, and
+encoded this by specifying the ``parallel_unsequenced_policy``, we rely on a
+pass over IR to clean up any and all code that was not "meant" for offload. If
+requested, allocation interposition is also handled via a separate pass over IR.
+
+To interface with the client HIP runtime, and to forward offloaded algorithm
+invocations to the corresponding accelerator specific library implementation, an
+implementation detail forwarding header is implicitly included by the driver,
+when compiling with ``--hipstdpar``. In what follows, we will delve into each
+component that contributes to implementing Algorithm Offload support.
+
+Implementation - Driver
+=======================
+
+We augment the ``clang`` driver with the following flags:
+
+- ``--hipstdpar`` enables algorithm offload, which depending on phase, has the
+  following effects:
+
+  - when compiling:
+
+    - ``-x hip`` gets prepended to enable HIP support;
+    - the ``ROCmToolchain`` component checks for the ``hipstdpar_lib.hpp``
+      forwarding header,
+      `rocThrust <https://rocm.docs.amd.com/projects/rocThrust/en/latest/>`_ and
+      `rocPrim <https://rocm.docs.amd.com/projects/rocPRIM/en/latest/>`_ in
+      their canonical locations, which can be overriden via flags found below;
+      if all are found, the forwarding header gets implicitly included,
+      otherwise an error listing the missing component is generated;
+    - the ``LangOpts.HIPStdPar`` member is set.
+
+  - when linking:
+
+    - ``--hip-link`` and ``-frtlib-add-rpath`` gets appended to enable HIP
+      support.
+
+- ``--hipstdpar-interpose-alloc`` enables the interposition of standard
+  allocation / deallocation functions with accelerator aware equivalents; the
+  ``LangOpts.HIPStdParInterposeAlloc`` member is set;
+- ``--hipstdpar-path=`` specifies a non-canonical path for the forwarding
+  header; it must point to the folder where the header is located and not to the
+  header itself;
+- ``--hipstdpar-thrust-path=`` specifies a non-canonical path for
+  `rocThrust <https://rocm.docs.amd.com/projects/rocThrust/en/latest/>`_; it
+  must point to the folder where the library is installed / built under a
+  ``/thrust`` subfolder;
+- ``--hipstdpar-prim-path=`` specifies a non-canonical path for
+  `rocPrim <https://rocm.docs.amd.com/projects/rocPRIM/en/latest/>`_; it must
+  point to the folder where the library is installed / built under a
+  ``/rocprim`` subfolder;
+
+The `--offload-arch <https://llvm.org/docs/AMDGPUUsage.html#amdgpu-processors>`_
+flag can be used to specify the accelerator for which offload code is to be
+generated.
+
+Implementation - Front-End
+==========================
+
+When ``LangOpts.HIPStdPar`` is set, we relax some of the HIP language specific
+``Sema`` checks to account for the fact that we want to consume pure unannotated
+C++ code:
+
+1. ``__device__`` / ``__host__ __device__`` functions (which would originate in
+   the accelerator specific algorithm library) are allowed to call implicitly
+   ``__host__`` functions;
+2. ``__global__`` functions (which would originate in the accelerator specific
+   algorithm library) are allowed to call implicitly ``__host__`` functions;
+3. resolving ``__builtin`` availability is deferred, because it is possible that
+   a ``__builtin`` that is unavailable on the target accelerator is not
+   reachable from any offloaded algorithm, and thus will be safely removed in
+   the middle-end;
+4. ASM parsing / checking is deferred, because it is possible that an ASM block
+   that e.g. uses some constraints that are incompatible with the target
+   accelerator is not reachable from any offloaded algorithm, and thus will be
+   safely removed in the middle-end.
+
+``CodeGen`` is similarly relaxed, with implicitly ``__host__`` functions being
+emitted as well.
+
+Implementation - Middle-End
+===========================
+
+We add two ``opt`` passes:
+
+1. ``HipStdParAcceleratorCodeSelectionPass``
+
+   - For all kernels in a ``Module``, compute reachability, where a function
+     ``F`` is reachable from a kernel ``K`` if and only if there exists a direct
+     call-chain rooted in ``F`` that includes ``K``;
+   - Remove all functions that are not reachable from kernels;
+   - This pass is only run when compiling for the accelerator.
+
+The first pass assumes that the only code that the user intended to offload was
+that which was directly or transitively invocable as part of an algorithm
+execution. It also assumes that an accelerator aware algorithm implementation
+would rely on accelerator specific special functions (kernels), and that these
+effectively constitute the only roots for accelerator execution graphs. Both of
+these assumptions are based on observing how widespread accelerators,
+such as GPUs, work.
+
+1. ``HipStdParAllocationInterpositionPass``
+
+   - Iterate through all functions in a ``Module``, and replace standard
+     allocation / deallocation functions with accelerator-aware equivalents,
+     based on a pre-established table; the list of functions that can be
+     interposed is available
+     `here <https://github.com/ROCmSoftwarePlatform/roc-stdpar#allocation--deallocation-interposition-status>`_;
+   - This is only run when compiling for the host.
+
+The second pass is optional.
+
+Implementation - Forwarding Header
+==================================
+
+The forwarding header implements two pieces of functionality:
+
+1. It forwards algorithms to a target accelerator, which is done by relying on
+   C++ language rules around overloading:
+
+   - overloads taking an explicit argument of type
+     ``parallel_unsequenced_policy`` are introduced into the ``std`` namespace;
+   - these will get preferentially selected versus the master template;
+   - the body forwards to the equivalent algorithm from the accelerator specific
+     library
+
+2. It provides allocation / deallocation functions that are equivalent to the
+   standard ones, but obtain memory by invoking
+   `hipMallocManaged <https://rocm.docs.amd.com/projects/HIP/en/latest/.doxygen/docBin/html/group___memory_m.html#gab8cfa0e292193fa37e0cc2e4911fa90a>`_
+   and release it via `hipFree <https://rocm.docs.amd.com/projects/HIP/en/latest/.doxygen/docBin/html/group___memory.html#ga740d08da65cae1441ba32f8fedb863d1>`_.
+
+Predefined Macros
+=================
+
+.. list-table::
+   :header-rows: 1
+
+   * - Macro
+     - Description
+   * - ``__HIPSTDPAR__``
+     - Defined when Clang is compiling code in algorithm offload mode, enabled
+       with the ``--hipstdpar`` compiler option.
+   * - ``__HIPSTDPAR_INTERPOSE_ALLOC__``
+     - Defined only when compiling in algorithm offload mode, when the user
+       enables interposition mode with the ``--hipstdpar-interpose-alloc``
+       compiler option, indicating that all dynamic memory allocation /
+       deallocation functions should be replaced with accelerator aware
+       variants.
+
+Restrictions
+============
+
+We define two modes in which runtime execution can occur:
+
+1. **HMM Mode** - this assumes that the
+   `HMM <https://docs.kernel.org/mm/hmm.html>`_ subsystem of the Linux kernel
+   is used to provide transparent on-demand paging i.e. memory obtained from a
+   system / OS allocator such as via a call to ``malloc`` or ``operator new`` is
+   directly accessible to the accelerator and it follows the C++ memory model;
+2. **Interposition Mode** - this is a fallback mode for cases where transparent
+   on-demand paging is unavailable (e.g. in the Windows OS), which means that
+   memory must be allocated via an accelerator aware mechanism, and system
+   allocated memory is inaccessible for the accelerator.
+
+The following restrictions imposed on user code apply to both modes:
+
+1. Pointers to function, and all associated features, such as e.g. dynamic
+   polymorphism, cannot be used (directly or transitively) by the user provided
+   callable passed to an algorithm invocation;
+2. Global / namespace scope / ``static`` / ``thread`` storage duration variables
+   cannot be used (directly or transitively) in name by the user provided
+   callable;
+
+   - When executing in **HMM Mode** they can be used in address e.g.:
+
+     .. code-block:: C++
+
+        namespace { int foo = 42; }
+
+        bool never(const std::vector<int>& v) {
+          return std::any_of(std::execution::par_unseq, std::cbegin(v), std::cend(v), [](auto&& x) {
+            return x == foo;
+          });
+        }
+
+        bool only_in_hmm_mode(const std::vector<int>& v) {
+          return std::any_of(std::execution::par_unseq, std::cbegin(v), std::cend(v),
+                             [p = &foo](auto&& x) { return x == *p; });
+        }
+
+3. Only algorithms that are invoked with the ``parallel_unsequenced_policy`` are
+   candidates for offload;
+4. Only algorithms that are invoked with iterator arguments that model
+   `random_access_iterator <https://en.cppreference.com/w/cpp/iterator/random_access_iterator>`_
+   are candidates for offload;
+5. `Exceptions <https://en.cppreference.com/w/cpp/language/exceptions>`_ cannot
+   be used by the user provided callable;
+6. Dynamic memory allocation (e.g. ``operator new``) cannot be used by the user
+   provided callable;
+7. Selective offload is not possible i.e. it is not possible to indicate that
+   only some algorithms invoked with the ``parallel_unsequenced_policy`` are to
+   be executed on the accelerator.
+
+In addition to the above, using **Interposition Mode** imposes the following
+additional restrictions:
+
+1. All code that is expected to interoperate has to be recompiled with the
+   ``--hipstdpar-interpose-alloc`` flag i.e. it is not safe to compose libraries
+   that have been independently compiled;
+2. automatic storage duration (i.e. stack allocated) variables cannot be used
+   (directly or transitively) by the user provided callable e.g.
+
+   .. code-block:: c++
+
+      bool never(const std::vector<int>& v, int n) {
+        return std::any_of(std::execution::par_unseq, std::cbegin(v), std::cend(v),
+                           [p = &n](auto&& x) { return x == *p; });
+      }
+
+Current Support
+===============
+
+At the moment, C++ Standard Parallelism Offload is only available for AMD GPUs,
+when the `ROCm <https://rocm.docs.amd.com/en/latest/>`_ stack is used, on the
+Linux operating system. Support is synthesised in the following table:
+
+.. list-table::
+   :header-rows: 1
+
+   * - `Processor <https://llvm.org/docs/AMDGPUUsage.html#amdgpu-processors>`_
+     - HMM Mode
+     - Interposition Mode
+   * - GCN GFX9 (Vega)
+     - YES
+     - YES
+   * - GCN GFX10.1 (RDNA 1)
+     - *NO*
+     - YES
+   * - GCN GFX10.3 (RDNA 2)
+     - *NO*
+     - YES
+   * - GCN GFX11 (RDNA 3)
+     - *NO*
+     - YES
+
+The minimum Linux kernel version for running in HMM mode is 6.4.
+
+The forwarding header can be obtained from
+`its GitHub repository <https://github.com/ROCmSoftwarePlatform/roc-stdpar>`_.
+It will be packaged with a future `ROCm <https://rocm.docs.amd.com/en/latest/>`_
+release. Because accelerated algorithms are provided via
+`rocThrust <https://rocm.docs.amd.com/projects/rocThrust/en/latest/>`_, a
+transitive dependency on
+`rocPrim <https://rocm.docs.amd.com/projects/rocPRIM/en/latest/>`_ exists. Both
+can be obtained either by installing their associated components of the
+`ROCm <https://rocm.docs.amd.com/en/latest/>`_ stack, or from their respective
+repositories. The list algorithms that can be offloaded is available
+`here <https://github.com/ROCmSoftwarePlatform/roc-stdpar#algorithm-support-status>`_.
+
+HIP Specific Elements
+---------------------
+
+1. There is no defined interop with the
+   `HIP kernel language <https://rocm.docs.amd.com/projects/HIP/en/latest/reference/kernel_language.html>`_;
+   whilst things like using `__device__` annotations might accidentally "work",
+   they are not guaranteed to, and thus cannot be relied upon by user code;
+
+   - A consequence of the above is that both bitcode linking and linking
+     relocatable object files will "work", but it is not guaranteed to remain
+     working or actively tested at the moment; this restriction might be relaxed
+     in the future.
+
+2. Combining explicit HIP, CUDA or OpenMP Offload compilation with
+   ``--hipstdpar`` based offloading is not allowed or supported in any way.
+3. There is no way to target different accelerators via a standard algorithm
+   invocation (`this might be addressed in future C++ standards <https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2023/p2500r1.html>`_);
+   an unsafe (per the point above) way of achieving this is to spawn new threads
+   and invoke the `hipSetDevice <https://rocm.docs.amd.com/projects/HIP/en/latest/.doxygen/docBin/html/group___device.html#ga43c1e7f15925eeb762195ccb5e063eae>`_
+   interface e.g.:
+
+   .. code-block:: c++
+
+      int accelerator_0 = ...;
+      int accelerator_1 = ...;
+
+      bool multiple_accelerators(const std::vector<int>& u, const std::vector<int>& v) {
+        std::atomic<unsigned int> r{0u};
+
+        thread t0{[&]() {
+          hipSetDevice(accelerator_0);
+
+          r += std::count(std::execution::par_unseq, std::cbegin(u), std::cend(u), 42);
+        }};
+        thread t1{[&]() {
+          hitSetDevice(accelerator_1);
+
+          r += std::count(std::execution::par_unseq, std::cbegin(v), std::cend(v), 314152)
+        }};
+
+        t0.join();
+        t1.join();
+
+        return r;
+      }
+
+   Note that this is a temporary, unsafe workaround for a deficiency in the C++
+   Standard.
+
+Open Questions / Future Developments
+====================================
+
+1. The restriction on the use of global / namespace scope / ``static`` /
+   ``thread`` storage duration variables in offloaded algorithms will be lifted
+   in the future, when running in **HMM Mode**;
+2. The restriction on the use of dynamic memory allocation in offloaded
+   algorithms will be lifted in the future.
+3. The restriction on the use of pointers to function, and associated features
+   such as dynamic polymorphism might be lifted in the future, when running in
+   **HMM Mode**;
+4. Offload support might be extended to cases where the ``parallel_policy`` is
+   used for some or all targets.
+
 SPIR-V Support on HIPAMD ToolChain
 ==================================
 
@@ -316,13 +709,6 @@ Using ``--offload-arch=amdgcnspirv``
 - **Clang Offload Bundler**: The resulting SPIR-V is embedded in the Clang
   offload bundler with the bundle ID ``hip-spirv64-amd-amdhsa--amdgcnspirv``.
 
-Mixed with Normal ``--offload-arch``
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-**Mixing ``amdgcnspirv`` and concrete ``gfx###`` targets via ``--offload-arch``
-is not currently supported; this limitation is temporary and will be removed in
-a future release**
-
 Architecture Specific Macros
 ----------------------------
 

>From 0628acb08ddf08d7df480a1913c4b84edb48e4c0 Mon Sep 17 00:00:00 2001
From: Alex Voicu <alexandru.voicu at amd.com>
Date: Tue, 28 Jan 2025 19:12:05 +0000
Subject: [PATCH 2/3] Fix inline literal, add RDNA4 to the list of
 architectures needing interposition mode.

---
 clang/docs/HIPSupport.rst | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/clang/docs/HIPSupport.rst b/clang/docs/HIPSupport.rst
index 34abd389b36fe4..a0ce669ba86aa9 100644
--- a/clang/docs/HIPSupport.rst
+++ b/clang/docs/HIPSupport.rst
@@ -340,8 +340,8 @@ transparent on-demand paging (such as e.g. that provided in Linux via
 the ``--hipstdpar-interpose-alloc`` flag. If the accelerator specific algorithm
 library ``foo`` uses doesn't have an implementation of a particular algorithm,
 execution seamlessly falls back to the host CPU. It is legal to specify multiple
-``--offload-arch``s. All the flags we introduce, as well as a thorough view of
-various restrictions and their implications will be provided below.
+``--offload-arch``\s. All the flags we introduce, as well as a thorough view of
+various restrictions an their implementations, will be provided below.
 
 Implementation - General View
 =============================
@@ -600,6 +600,9 @@ Linux operating system. Support is synthesised in the following table:
    * - GCN GFX11 (RDNA 3)
      - *NO*
      - YES
+   * - GCN GFX12 (RDNA 4)
+     - *NO*
+     - YES
 
 The minimum Linux kernel version for running in HMM mode is 6.4.
 

>From c102d61b176bf1d68fa75cfaf5614e54d2f84ee4 Mon Sep 17 00:00:00 2001
From: Alex Voicu <alexandru.voicu at amd.com>
Date: Tue, 28 Jan 2025 19:22:36 +0000
Subject: [PATCH 3/3] Hopefully final fix for RST issues.

---
 clang/docs/HIPSupport.rst | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/clang/docs/HIPSupport.rst b/clang/docs/HIPSupport.rst
index a0ce669ba86aa9..481ed392308130 100644
--- a/clang/docs/HIPSupport.rst
+++ b/clang/docs/HIPSupport.rst
@@ -308,7 +308,7 @@ The `parallel_unsequenced_policy <https://en.cppreference.com/w/cpp/algorithm/ex
 maps relatively well to the execution model of AMD GPUs. This, coupled with the
 the availability and maturity of GPU accelerated algorithm libraries that
 implement most / all corresponding algorithms in the standard library
-(e.g. `rocThrust <https://github.com/ROCmSoftwarePlatform/rocThrust>`_), makes
+(e.g. `rocThrust <https://github.com/ROCmSoftwarePlatform/rocThrust>`__), makes
 it feasible to provide seamless accelerator offload for supported algorithms,
 when an accelerated version exists. Thus, it becomes possible to easily access
 the computational resources of an AMD accelerator, via a well specified,
@@ -463,7 +463,7 @@ such as GPUs, work.
      allocation / deallocation functions with accelerator-aware equivalents,
      based on a pre-established table; the list of functions that can be
      interposed is available
-     `here <https://github.com/ROCmSoftwarePlatform/roc-stdpar#allocation--deallocation-interposition-status>`_;
+     `here <https://github.com/ROCmSoftwarePlatform/roc-stdpar#allocation--deallocation-interposition-status>`__;
    - This is only run when compiling for the host.
 
 The second pass is optional.