[libc-commits] [libc] [libc] Update GPU documentation pages (PR #84076)

Tue Mar 5 14:01:55 PST 2024

https://github.com/jhuber6 created https://github.com/llvm/llvm-project/pull/84076

Summary:
After the overhaul of the GPU build the documentation pages were a
little stale. This updates them with more in-depth information on
building the GPU runtimes and using them. Specifically using them goes
through the differences between the offloading and direct compilation
modes.


>From 87fabb8697ad1695a63aeda9618e8e12a320fa41 Mon Sep 17 00:00:00 2001
From: Joseph Huber <huberjn at outlook.com>
Date: Tue, 5 Mar 2024 08:29:32 -0600
Subject: [PATCH] [libc] Update GPU documentation pages

Summary:
After the overhaul of the GPU build the documentation pages were a
little stale. This updates them with more in-depth information on
building the GPU runtimes and using them. Specifically using them goes
through the differences between the offloading and direct compilation
modes.
---
 libc/docs/full_cross_build.rst |  11 ++
 libc/docs/gpu/building.rst     | 243 ++++++++++++++++++++++++++++++++
 libc/docs/gpu/index.rst        |   3 +-
 libc/docs/gpu/rpc.rst          |   2 +
 libc/docs/gpu/using.rst        | 248 ++++++++++++++++++++++++++-------
 5 files changed, 458 insertions(+), 49 deletions(-)
 create mode 100644 libc/docs/gpu/building.rst

diff --git a/libc/docs/full_cross_build.rst b/libc/docs/full_cross_build.rst
index f06464534f152d..100e17a977e764 100644
--- a/libc/docs/full_cross_build.rst
+++ b/libc/docs/full_cross_build.rst
@@ -94,6 +94,8 @@ The above ``ninja`` command will build the libc static archives ``libc.a`` and
 ``libm.a`` for the target specified with ``-DLIBC_TARGET_TRIPLE`` in the CMake
 configure step.
 
+.. _runtimes_cross_build:
+
 Runtimes cross build
 ====================
 
@@ -230,3 +232,12 @@ component of the target triple as ``none``. For example, to build for a
 32-bit arm target on bare metal, one can use a target triple like
 ``arm-none-eabi``. Other than that, the libc for a bare metal target can be
 built using any of the three recipes described above.
+
+Building for the GPU
+====================
+
+To build for a GPU architecture, it should only be necessary to specify the 
+target triple as one of the supported GPU targets. Currently, this is either 
+``nvptx64-nvidia-cuda`` for NVIDIA GPUs or ``amdgcn-amd-amdhsa`` for AMD GPUs. 
+More detailed information is provided in the :ref:`GPU 
+documentation<libc_gpu_building>`.
diff --git a/libc/docs/gpu/building.rst b/libc/docs/gpu/building.rst
new file mode 100644
index 00000000000000..96ea678d7a6b64
--- /dev/null
+++ b/libc/docs/gpu/building.rst
@@ -0,0 +1,243 @@
+.. _libc_gpu_building:
+
+======================
+Building libs for GPUs
+======================
+
+.. contents:: Table of Contents
+  :depth: 4
+  :local:
+
+Building the GPU C library
+========================
+
+This document will present recipes to build the LLVM C library targeting a GPU
+architecture. The GPU build uses the same :ref:`cross build<full_cross_build>`
+support as the other targets. However, the GPU target has the restriction that
+it *must* be built with an up-to-date ``clang`` compiler. This is because the
+GPU target uses several compiler extensions to target GPU architectures.
+
+The LLVM C library currently supports two GPU targets. This is either
+``nvptx64-nvidia-cuda`` for NVIDIA GPUs or ``amdgcn-amd-amdhsa`` for AMD GPUs.
+Targeting these architectures is done through ``clang``'s cross-compiling
+support using the ``--target=<triple>`` flag. The following sections will
+describe how to build the GPU support specifically.
+
+Once you have finished building, refer to :ref:`libc_gpu_usage` to get started
+with the newly built C library.
+
+Standard runtimes build
+-----------------------
+
+The simplest way to build the GPU libc is to use the existing LLVM runtimes
+support. This will automatically handle bootstrapping an up-to-date ``clang``
+compiler and using it to build the C library. The following CMake invocation
+will instruct it to build the ``libc`` runtime targeting both AMD and NVIDIA
+GPUs.
+
+.. code-block:: sh
+
+  $> cd llvm-project  # The llvm-project checkout
+  $> mkdir build
+  $> cd build
+  $> cmake ../llvm -G Ninja                                                 \
+     -DLLVM_ENABLE_PROJECTS="clang;lld;compiler-rt"                         \
+     -DLLVM_ENABLE_RUNTIMES="openmp"                                        \
+     -DCMAKE_BUILD_TYPE=<Debug|Release>   \ # Select build type
+     -DCMAKE_INSTALL_PREFIX=<PATH>        \ # Where the libraries will live
+     -DRUNTIMES_nvptx64-nvidia-cuda_LLVM_ENABLE_RUNTIMES=libc               \
+     -DRUNTIMES_amdgcn-amd-amdhsa_LLVM_ENABLE_RUNTIMES=libc                 \
+     -DLLVM_RUNTIME_TARGETS=default;amdgcn-amd-amdhsa;nvptx64-nvidia-cuda
+  $> ninja install
+
+
+Since we want to include ``clang``, ``lld`` and ``compiler-rt`` in our
+toolchain, we list them in ``LLVM_ENABLE_PROJECTS``. To ensure ``libc`` is built
+using a compatible compiler and to support ``openmp`` offloading, we list them
+in ``LLVM_ENABLE_RUNTIMES`` to build them after the enabled projects using the
+newly built compiler. We enable the ``libc`` project only for the GPU
+architectures. The ``lld`` linker is required to produce AMDGPU executables
+while ``openmp`` has first-class support for the GPU libc.
+
+Runtimes cross build
+--------------------
+
+For users wanting more direct control over the build process, the build steps
+can be done manually instead. This build closely follows the instructions in the
+:ref:`main documentation<runtimes_cross_build>` but is specialized for the GPU
+build. We follow the same steps to first build the libc tools and a suitable
+compiler. These tools must all be up-to-date with the libc source.
+
+.. code-block:: sh
+
+  $> cd llvm-project  # The llvm-project checkout
+  $> mkdir build-libc-tools # A different build directory for the build tools
+  $> cd build-libc-tools
+  $> HOST_C_COMPILER=<C compiler for the host> # For example "clang"
+  $> HOST_CXX_COMPILER=<C++ compiler for the host> # For example "clang++"
+  $> cmake ../llvm                            \
+     -G Ninja                                 \
+     -DLLVM_ENABLE_PROJECTS=libc              \
+     -DCMAKE_C_COMPILER=$HOST_C_COMPILER      \
+     -DCMAKE_CXX_COMPILER=$HOST_CXX_COMPILER  \
+     -DLLVM_LIBC_FULL_BUILD=ON                \
+     -DLIBC_HDRGEN_ONLY=ON    \ # Only build the 'libc-hdrgen' tool
+     -DCMAKE_BUILD_TYPE=Release # Release suggested to make "clang" fast
+  $> ninja # Build the 'clang' compiler
+  $> ninja libc-hdrgen # Build the 'libc-hdrgen' tool
+
+Once this has finished the build directory should contain the ``clang`` compiler
+and the ``libc-hdrgen`` executable. We will use the ``clang`` compiler to build
+the GPU code and the ``libc-hdrgen`` tool to create the necessary headers. We
+use these tools to bootstrap the build out of the runtimes directory targeting
+
+.. code-block:: sh
+
+  $> cd llvm-project  # The llvm-project checkout
+  $> mkdir build # A different build directory for the build tools
+  $> cd build
+  $> TARGET_TRIPLE=<amdgcn-amd-amdhsa or nvptx64-nvidia-cuda>
+  $> TARGET_C_COMPILER=</path/to/clang>
+  $> TARGET_CXX_COMPILER=</path/to/clang++>
+  $> HDRGEN=</path/to/libc-hdrgen>
+  $> cmake ../runtimes \ # Point to the runtimes build
+     -G Ninja                                  \
+     -DLLVM_ENABLE_RUNTIMES=libc               \
+     -DCMAKE_C_COMPILER=$TARGET_C_COMPILER     \
+     -DCMAKE_CXX_COMPILER=$TARGET_CXX_COMPILER \
+     -DLLVM_LIBC_FULL_BUILD=ON                 \
+     -DLLVM_RUNTIMES_TARGET=$TARGET_TRIPLE     \
+     -DLIBC_HDRGEN_EXE=$HDRGEN                 \
+     -DCMAKE_BUILD_TYPE=Release
+  $> ninja install
+
+The above steps will result in a build targeting one of the supported GPU
+architectures. Building for multiple targets requires separate CMake
+invocations.
+
+Standalone cross build
+----------------------
+
+The GPU build can also be targeted directly as long as the compiler used is a
+supported ``clang`` compiler. This method is generally not recommended as it can
+only target a single GPU architecture.
+
+.. code-block:: sh
+
+  $> cd llvm-project  # The llvm-project checkout
+  $> mkdir build # A different build directory for the build tools
+  $> cd build
+  $> CLANG_C_COMPILER=</path/to/clang> # Must be a trunk build
+  $> CLANG_CXX_COMPILER=</path/to/clang++> # Must be a trunk build
+  $> TARGET_TRIPLE=<amdgcn-amd-amdhsa or nvptx64-nvidia-cuda>
+  $> cmake ../runtimes \ # Point to the runtimes build
+     -G Ninja                                 \
+     -DLLVM_ENABLE_RUNTIMES=libc              \
+     -DCMAKE_C_COMPILER=$CLANG_C_COMPILER     \
+     -DCMAKE_CXX_COMPILER=$CLANG_CXX_COMPILER \
+     -DLLVM_LIBC_FULL_BUILD=ON                \
+     -DLIBC_TARGET_TRIPLE=$TARGET_TRIPLE      \
+     -DCMAKE_BUILD_TYPE=Release
+  $> ninja install
+
+This will build and install the GPU C library along with all the other LLVM
+libraries.
+
+Build overview
+==============
+
+Once installed, the GPU build will create several files used for different
+targets. This section will briefly describe their purpose.
+
+**lib/<host-triple>/libcgpu-amdgpu.a or lib/libcgpu-amdgpu.a**
+  A static library containing fat binaries supporting AMD GPUs. These are built
+  using the support described in the `clang documentation
+  <https://clang.llvm.org/docs/OffloadingDesign.html>`_. These are intended to
+  be static libraries included natively for offloading languages like CUDA, HIP,
+  or OpenMP. This implements the standard C library.
+
+**lib/<host-triple>/libmgpu-amdgpu.a or lib/libmgpu-amdgpu.a**
+  A static library containing fat binaries that implements the standard math
+  library for AMD GPUs.
+
+**lib/<host-triple>/libcgpu-nvptx.a or lib/libcgpu-nvptx.a**
+  A static library containing fat binaries that implement the standard C library
+  for NVIDIA GPUs.
+
+**lib/<host-triple>/libcgpu-nvptx.a or lib/libcgpu-nvptx.a**
+  A static library containing fat binaries that implement the standard math
+  library for NVIDIA GPUs.
+
+**include/<target-triple>**
+  The include directory where all of the generated headers for the target will
+  go. These definitions are strictly for the GPU when being targeted directly.
+
+**lib/clang/<llvm-major-version>/include/llvm-libc-wrappers/llvm-libc-decls**
+  These are wrapper headers created for offloading languages like CUDA, HIP, or
+  OpenMP. They contain functions supported in the GPU libc along with attributes
+  and metadata that declare them on the target device and make them compatible
+  with the host headers.
+
+**lib/<target-triple>/libc.a**
+  The main C library static archive containing LLVM-IR targeting the given GPU.
+  It can be linked directly or inspected depending on the target support.
+
+**lib/<target-triple>/libm.a**
+  The C library static archive providing implementations of the standard math
+  functions.
+
+**lib/<target-triple>/libc.bc**
+  An alternate form of the library provided as a single LLVM-IR bitcode blob.
+  This can be used similarly to NVIDIA's or AMD's device libraries.
+
+**lib/<target-triple>/libm.bc**
+  An alternate form of the library provided as a single LLVM-IR bitcode blob
+  containing the standard math functions.
+
+**lib/<target-triple>/crt1.o**
+  An LLVM-IR file containing startup code to call the ``main`` function on the
+  GPU. This is used similarly to the standard C library startup object.
+
+**bin/amdhsa-loader**
+  A binary utility used to launch executables compiled targeting the AMD GPU.
+  This will be included if the build system found the ``hsa-runtime64`` library
+  either in ``/opt/rocm`` or the current CMake installation directory. See
+  the :ref:`libc GPU usage<libc_gpu_usage>` for more information.
+
+**bin/nvptx-loader**
+  A binary utility used to launch executables compiled targeting the NVIDIA GPU.
+  This will be included if the build system found the CUDA driver API.
+
+**include/llvm-libc-rpc-server.h**
+  A header file containing definitions that can be used to interface with the
+  :ref:`RPC server<libc_gpu_rpc>`.
+
+**lib/libllvmlibc_rpc_server.a**
+  The static library containing the implementation of the RPC server. This can
+  be used to enable host services for anyone looking to interface with the
+  :ref:`RPC client<libc_gpu_rpc>`.
+
+CMake options
+=============
+
+This section briefly lists a few of the CMake variables that specifically
+control the GPU build of the C library.
+
+**LLVM_LIBC_FULL_BUILD**:BOOL
+  This flag controls whether or not the libc build will generate its own
+  headers. This must always be on when targeting the GPU.
+
+**LIBC_GPU_TEST_ARCHITECTURE**:STRING
+  Sets the architecture used to build the GPU tests for, such as ``gfx90a`` or
+  ``sm_80`` for AMD and NVIDIA GPUs respectively. The default behavior is to
+  detect the system's GPU architecture using the ``native`` option. If this
+  option is not set and a GPU was not detected the tests will not be built.
+
+**LIBC_GPU_TEST_JOBS**:STRING
+  Sets the number of threads used to run GPU tests. The GPU test suite will
+  commonly run out of resources if this is not contrained so it is recommended
+  to keep it low. The default value is a single thread.
+
+**LIBC_GPU_LOADER_EXECUTABLE**:STRING
+  Overrides the default loader used for running GPU tests. If this is not
+  provided the standard one will be built.
diff --git a/libc/docs/gpu/index.rst b/libc/docs/gpu/index.rst
index 2d765486665040..1fca67205acb4d 100644
--- a/libc/docs/gpu/index.rst
+++ b/libc/docs/gpu/index.rst
@@ -12,8 +12,9 @@ learn more about this project.
 
 .. toctree::
 
+   building
    using
    support
-   testing
    rpc
+   testing
    motivation
diff --git a/libc/docs/gpu/rpc.rst b/libc/docs/gpu/rpc.rst
index 7b0b35af4da88c..9d6d8099db951c 100644
--- a/libc/docs/gpu/rpc.rst
+++ b/libc/docs/gpu/rpc.rst
@@ -188,6 +188,8 @@ in the GPU executable as an indicator for whether or not the server can be
 checked. These details should ideally be handled by the GPU language runtime,
 but the following example shows how it can be used by a standard user.
 
+.. _libc_gpu_cuda_server:
+
 .. code-block:: cuda
 
   #include <cstdio>
diff --git a/libc/docs/gpu/using.rst b/libc/docs/gpu/using.rst
index 1a48c8a3bcba31..a24daac09a7536 100644
--- a/libc/docs/gpu/using.rst
+++ b/libc/docs/gpu/using.rst
@@ -1,6 +1,5 @@
 .. _libc_gpu_usage:
 
-
 ===================
 Using libc for GPUs
 ===================
@@ -9,54 +8,98 @@ Using libc for GPUs
   :depth: 4
   :local:
 
-Building the GPU library
-========================
+Using the GPU C library
+=======================
 
-LLVM's libc GPU support *must* be built with an up-to-date ``clang`` compiler
-due to heavy reliance on ``clang``'s GPU support. This can be done automatically
-using the LLVM runtimes support. The GPU build is done using cross-compilation
-to the GPU architecture. This project currently supports AMD and NVIDIA GPUs
-which can be targeted using the appropriate target name. The following
-invocation will enable a cross-compiling build for the GPU architecture and
-enable the ``libc`` project only for them.
+Once you have finished :ref:`building<libc_gpu_building>` the GPU C library it
+can be used to run libc or libm functions directly on the GPU. Currently, not
+all C standard functions are supported on the GPU. Consult the :ref:`list of
+supported functions<libc_gpu_support>` for a comprehensive list.
 
-.. code-block:: sh
+The GPU C library supports two main usage modes. The first is as a supplementary
+library for offloading languages such as OpenMP, CUDA, or HIP. These aim to
+provide standard system utilities similarly to existing vendor libraries. The
+second method treats the GPU as a hosted target by compiling C or C++ for it
+directly. This is more similar to targeting OpenCL and is primarily used for
+testing.
+
+Offloading usage
+----------------
 
-  $> cd llvm-project  # The llvm-project checkout
-  $> mkdir build
-  $> cd build
-  $> cmake ../llvm -G Ninja                                               \
-     -DLLVM_ENABLE_PROJECTS="clang;lld;compiler-rt"                       \
-     -DLLVM_ENABLE_RUNTIMES="openmp"                                      \
-     -DCMAKE_BUILD_TYPE=<Debug|Release>   \ # Select build type
-     -DCMAKE_INSTALL_PREFIX=<PATH>        \ # Where 'libcgpu.a' will live
-     -DRUNTIMES_nvptx64-nvidia-cuda_LLVM_ENABLE_RUNTIMES=libc             \
-     -DRUNTIMES_amdgcn-amd-amdhsa_LLVM_ENABLE_RUNTIMES=libc               \
-     -DLLVM_RUNTIME_TARGETS=default;amdgcn-amd-amdhsa;nvptx64-nvidia-cuda
-  $> ninja install
-
-Since we want to include ``clang``, ``lld`` and ``compiler-rt`` in our
-toolchain, we list them in ``LLVM_ENABLE_PROJECTS``. To ensure ``libc`` is built
-using a compatible compiler and to support ``openmp`` offloading, we list them
-in ``LLVM_ENABLE_RUNTIMES`` to build them after the enabled projects using the
-newly built compiler. ``CMAKE_INSTALL_PREFIX`` specifies the installation
-directory in which to install the ``libcgpu-nvptx.a`` and ``libcgpu-amdgpu.a``
-libraries and headers along with LLVM. The generated headers will be placed in
-``include/<gpu-triple>``.
-
-Usage
-=====
-
-Once the static archive has been built it can be linked directly
-with offloading applications as a standard library. This process is described in
+Offloading languages like CUDA, HIP, or OpenMP work by compiling a single source
+file for both the host target and a list of offloading devices. In order to
+support standard compilation flows, the ``clang`` driver uses fat binaries to
 the `clang documentation <https://clang.llvm.org/docs/OffloadingDesign.html>`_.
-This linking mode is used by the OpenMP toolchain, but is currently opt-in for
-the CUDA and HIP toolchains through the ``--offload-new-driver``` and
-``-fgpu-rdc`` flags. A typical usage will look this this:
+The This linking mode is used by the OpenMP toolchain, but is currently opt-in
+for the CUDA and HIP toolchains through the ``--offload-new-driver``` and
+``-fgpu-rdc`` flags.
+
+The installation should contain a static library called ``libcgpu-amdgpu.a`` or
+``libcgpu-nvptx.a`` depending on which GPU architectures your build targeted.
+These contain fat binaries compatible with the offloading toolchain such that
+they can be used directly.
 
 .. code-block:: sh
 
-  $> clang foo.c -fopenmp --offload-arch=gfx90a -lcgpu
+  $> clang opnemp.c -fopenmp --offload-arch=gfx90a -lcgpu-amdgpu
+  $> clang cuda.cu --offload-arch=sm_80 --offload-new-driver -fgpu-rdc -lcgpu-nvptx
+  $> clang hip.hip --offload-arch=gfx940 --offload-new-driver -fgpu-rdc -lcgpu-amdgpu
+
+This will automatically link in the needed function definitions if they were
+required used by the user's application. Normally using the ``-fgpu-rdc`` option
+results in sub-par performance due to ABA linking. However, the offloading
+toolchain supports the ``--foffload-lto`` option to support LTO on the target
+device.
+
+Offloading languages require that functions present on the device be declared as
+such. This is done with the ``__device__`` keyword in CUDA and HIP or the
+``declare target`` pragma in OpenMP. This requires that the LLVM C library
+exposes its implemented functions to the compiler when it is used to build. We
+support this by providing wrapper headers in the compiler's resource directory.
+These are located in ``<clang-resource-dir>/include/llvm-libc-wrappers`` in your
+installation.
+
+The support for HIP and CUDA is more experimental, requiring manual intervention
+to link and use the facilities. An example of this is shown in the :ref:`CUDA
+server example<libc_gpu_cuda_server>`. The OpenMP Offloading toolchain is
+completely integrated with the LLVM C library however. It will automatically
+handle including the necessary libraries, define device-side interfaces, and run
+the RPC server.
+
+OpenMP Offloading example
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+This section provides a simple example of compiling an OpenMP program with the
+GPU C library.
+
+.. code-block:: c++
+
+  #include <stdio.h>
+
+  int main() {
+    FILE *file = stderr;
+  #pragma omp target teams num_teams(2) thread_limit(2)
+  #pragma omp parallel num_threads(2)
+    { fputs("Hello from OpenMP!\n", file); }
+  }
+
+This can simply be compiled like any other OpenMP application to print from two
+threads and two blocks.
+
+.. clode-block:: sh
+
+  $> clang openmp.c -fopenmp --offload-arch=gfx90a
+  $> ./a.out
+  Hello from OpenMP!
+  Hello from OpenMP!
+  Hello from OpenMP!
+  Hello from OpenMP!
+
+Including the wrapper headers, linking the C library, and running the :ref:`RPC
+server<libc_gpu_rpc>` are all handled automatically by the compiler and runtime.
+
+Binary format
+^^^^^^^^^^^^^
 
 The ``libcgpu.a`` static archive is a fat-binary containing LLVM-IR for each
 supported target device. The supported architectures can be seen using LLVM's
@@ -64,8 +107,8 @@ supported target device. The supported architectures can be seen using LLVM's
 
 .. code-block:: sh
 
-  $> llvm-objdump --offloading libcgpu.a
-  libcgpu.a(strcmp.cpp.o):    file format elf64-x86-64
+  $> llvm-objdump --offloading libcgpu-amdgpu.a
+  libcgpu-amdgpu.a(strcmp.cpp.o):    file format elf64-x86-64
 
   OFFLOADING IMAGE [0]:
   kind            llvm ir
@@ -78,11 +121,120 @@ inspect the resulting code. This can be done using the following utilities:
 
 .. code-block:: sh
 
-   $> llvm-ar x libcgpu.a strcmp.cpp.o
-   $> clang-offload-packager strcmp.cpp.o --image=arch=gfx90a,file=gfx90a.bc
-   $> opt -S out.bc
-   ...
+  $> llvm-ar x libcgpu.a strcmp.cpp.o
+  $> clang-offload-packager strcmp.cpp.o --image=arch=generic,file=gfx90a.bc
+  $> opt -S out.bc
+  ...
 
 Please note that this fat binary format is provided for compatibility with
 existing offloading toolchains. The implementation in ``libc`` does not depend
 on any existing offloading languages and is completely freestanding.
+
+Direct compilation
+------------------
+
+Instead of using standard offloading languages, we can also target the CPU
+directly using C and C++ to create a GPU executable similarly to OpenCL. This is
+done by targeting the GPU architecture directly using `clang's cross compilation
+support https://clang.llvm.org/docs/CrossCompilation.html`_. This is the method
+that the GPU C library uses both to provide its definitions and to run tests.
+
+This allows us to easily define GPU specific libraries and programs that fit
+well into existing tools. In order to target the GPU effectively we rely heavily
+on the compiler's intrinsic and built-in functions. For example, the following
+function gets the thread identifier in the 'x' dimension on both GPUs.
+
+.. code-block:: c++
+
+  uint32_t get_thread_id_x() {
+  #if defined(__AMDGPU__)
+    return __builtin_amdgcn_workitem_id_x();
+  #elif defined(__NVPTX__)
+    return __nvvm_read_ptx_sreg_tid_x();
+  #else
+  #error "Unsupported platform"
+  #endif
+  }
+
+We can then compile this for both NVPTX and AMDGPU into LLVM-IR using the
+following commands.
+
+.. code-block:: sh
+
+  $> clang id.c --target=amdgcn-amd-amdhsa -mcpu=native -nogpulib -flto -c
+  $> clang id.c --target=nvptx64-nvidia-cuda -march=native -nogpulib -flto -c
+
+This support allows us to treat the GPU as a hosted environment by providing a C
+library and startup object just like a standard C library running on the host
+machine. Then, in order to execute these images we provide a loader utility to
+launch the executable on the GPU similar to a cross-compiling emulator.
+
+Building for AMDGPU targets
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The AMDGPU target supports several features natively by virtue of using ``lld``
+as its linker. The installation will include the ``include/amdgcn-amd-amdhsa``
+and ``lib/amdgcn-amd-amdha`` directories that contain the necessary code to use
+the library. We can directly link against ``libc.a`` and use LTO to generate the
+final executable. The ``crt1.o`` object contains the kernels necessary to launch
+the ``main`` function.
+
+.. code-block:: c++
+
+  #include <stdio.h>
+
+  int main() { fputs("Hello from AMDGPU!\n", stdout); }
+
+This function can be compiled with the C library support. Note that ``-flto``
+and ``-mcpu=`` should be defined. This is because the GPU subarchitectures do
+not have strict backwards compatibility. Use ``-mcpu=help`` for accepted
+arguments or ``-mcpu=native`` to target the system's installed GPUs if present.
+Once built, we use the ``amdhsa-loader`` utility to launch execution on the GPU.
+This will be built if the ``hsa_runtime64`` library was found during build time.
+
+.. code-block:: sh
+
+  $> clang hello.c --target=amdgcn-amd-amdhsa -mcpu=native -flto -lc <install>/lib/amdgcn-amd-amdhsa/crt1.o
+  $> amdhsa-loader --threads 2 --blocks 2 a.out
+  Hello from AMDGPU!
+  Hello from AMDGPU!
+  Hello from AMDGPU!
+  Hello from AMDGPU!
+
+The search paths for the include directories and libraries are automatically
+handled by the compiler. We use this support internally to run unit tests on the
+GPU directly. See :ref:`libc_gpu_testing` for more information. The installation
+also probives ``libc.bc`` which is a single LLVM-IR bitcode blob that can be
+used instead of the static library.
+
+Building for NVPTX targets
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The infrastructure is the same as the AMDGPU example. However, the NVPTX binary
+utilities are very limited and must be targeted directly. There is no linker
+support for static libraries so we need to link in the ``libc.bc`` bitcode and
+inform the compiler driver of the file's contents.
+
+.. code-block:: c++
+
+  #include <stdio.h>
+
+  int main(int argc, char **argv, char **envp) {
+    fputs("Hello from NVPTX!\n", stdout);
+  }
+
+Additionally, the NVPTX ABI requires that every function signature matches. This
+requires us to pass the full prototype from ``main``. The installation will
+contain the ``nvptx-loader`` utility if the CUDA driver was found during
+compilation.
+
+.. code-block:: sh
+
+  $> clang hello.c --target=nvptx64-nvidia-cuda -march=native \
+       -x ir <install>/lib/nvptx64-nvidia-cuda/libc.bc \
+       -x ir <install>/lib/nvptx64-nvidia-cuda/crt1.o
+  $> nvptx-loader --threads 2 --blocks 2 a.out
+  Hello from NVPTX!
+  Hello from NVPTX!
+  Hello from NVPTX!
+  Hello from NVPTX!