[llvm-dev] RFC: Proposing an LLVM subproject for parallelism runtime and support libraries

Jason Henline via llvm-dev llvm-dev at lists.llvm.org
Wed Mar 9 14:20:15 PST 2016

At Google we're doing a lot of work on parallel programming models for
CPUs, GPUs and other platforms. One place where we're investing a lot are
parallel libraries, especially those closely tied to compiler technology
like runtime and math libraries. We would like to develop these in the
open, and the natural place seems to be as a subproject in LLVM if others
in the community are interested.

Initially, we'd like to open source our StreamExecutor runtime library,
which is used for simplifying the management of data-parallel workflows on
accelerator devices and can also be extended to support other hardware
platforms. We'd like to teach Clang to use StreamExecutor when targeting
CUDA and work on other integrations, but that makes much more sense if it
is part of the LLVM project.

However, we think the LLVM subproject should be organized as a set of
several libraries with StreamExecutor as just the first instance. As just
one example of how creating a unified parallelism subproject could help
with code sharing, the StreamExecutor library contains some nice wrappers
around the CUDA driver API and OpenCL API that create a unified API for
managing all kinds of GPU devices. This unified GPU wrapper would be
broadly applicable for libraries that need to communicate with GPU devices.

Of course, there is already an LLVM subproject for a parallel runtime
library: OpenMP! So there is a question of how it would fit into this
picture.  Eventually, it might make sense to pull in the OpenMP project as
a library in this proposed new subproject. In particular, there is a good
chance that OpenMP and StreamExecutor could share code for offloading to
GPUs and managing workloads on those devices. This is discussed at the end
of the StreamExecutor documentation below. However, if it turns out that
the needs of OpenMP are too specialized to fit well in a generic
parallelism project, then it may make sense to leave OpenMP as a separate
LLVM subproject so it can focus on serving the particular needs of OpenMP.

Documentation for the StreamExecutor library that is being proposed for
open-sourcing is included below to give a sense of what it is, in order to
give context for how it might fit into a general parallelism LLVM

What do folks think? Is there general interest in something like this? If
so, we can start working on getting a project in place and sketching out a
skeleton for how it would be organized, as well as contributing
StreamExecutor to it. We're happy to iterate on the particulars to figure
out what works for the community.

StreamExecutor Runtime Library Documentation

What is StreamExecutor?

**StreamExecutor** is a unified wrapper around the **CUDA** and **OpenCL**
host-side programming models (runtimes). It lets host code target either
CUDA or OpenCL devices with identically-functioning data-parallel kernels.
StreamExecutor manages the execution of concurrent work targeting the
accelerator similarly to how an Executor_ from the Google APIs client
library manages the execution of concurrent work on the host.

.. _Executor:

StreamExecutor is currently used as the runtime for the vast majority of
Google's internal GPGPU applications, and a snapshot of it is included in
the open-source TensorFlow_ project, where it serves as the GPGPU runtime.

.. _TensorFlow: https://www.tensorflow.org

It is currently proposed that StreamExecutor itself be independently
open-sourced. As part of that proposal, this document describes the basics
of its design and explains why it would fit in well as an LLVM subproject.

Key points


* abstracts the underlying accelerator platform (avoids locking you into a
single vendor, and lets you write code without thinking about which
platform you'll be running on).
* provides an open-source alternative to the CUDA runtime library.
* gives users a stream management model whose terminology matches that of
the CUDA programming model.
* makes use of modern C++ to create a safe, efficient, easy-to-use
programming interface.

StreamExecutor makes it easy to:

* move data between host and accelerator (and also between peer
* execute data-parallel kernels written in the OpenCL or CUDA kernel
* inspect the capabilities of a GPU-like device at runtime.
* manage multiple devices.

Example code snippet

The StreamExecutor API uses abstractions that will be familiar to those who
have worked with other GPU APIs: **Streams**, **Timers**, and **Kernels**.
Its API is *fluent*, meaning that it allows the user to chain together a
sequence of related operations on a stream, as in the following code

.. code-block:: c++

  se::Stream stream(executor);
  se::Timer timer(executor);
      .ThenLaunch(se::ThreadDim(dim_block_x, dim_block_y),
                  se::BlockDim(dim_grid_x, dim_grid_y),
                  arg0, arg1, arg2)

The name of the kernel being launched in the snippet above is `my_kernel`
and the arguments being passed to the kernel are `arg0`, `arg1`, and
`arg2`. Kernels with any number of arguments of any types are supported,
and the number and types of the arguments is checked at compile time.

How does it work?

Detailed example

The following example shows how we can use StreamExecutor to create a
`TypedKernel` instance, associate device code with that instance, and then
use that instance to schedule work on an accelerator device.

.. code-block:: c++

    #include <cassert>

    #include "stream_executor.h"

    namespace se = streamexecutor;

    // A PTX string defining a CUDA kernel.
    // This PTX string represents a kernel that takes two arguments: an
input value
    // and an output pointer. The input value is a floating point number.
The output
    // value is a pointer to a floating point value in device memory. The
    // pointer is where the output from the kernel will be written.
    // The kernel adds a fixed floating point value to the input and writes
    // result to the output location.
    static constexpr const char *KERNEL_PTX = R"(
        .version 3.1
        .target sm_20
        .address_size 64
        .visible .entry add_mystery_value(
            .param .f32 float_literal,
            .param .u64 result_loc
            ) {
          .reg .u64 %rl<2>;
          .reg .f32 %f<2>;
          ld.param.f32 %f1, [float_literal];
          ld.param.u64 %rl1, [result_loc];
          add.f32 %f1, %f1, 123.0;
          st.f32 [%rl1], %f1;

    // The number of arguments expected by the kernel described in
    static constexpr int KERNEL_ARITY = 2;

    // The name of the kernel described in KERNEL_PTX.
    static constexpr const char *KERNEL_NAME = "add_mystery_value";

    // The value added to the input in the kernel described in KERNEL_PTX.
    static constexpr float MYSTERY_VALUE = 123.0f;

    int main(int argc, char *argv[]) {
      // Get a CUDA Platform object. (Other platforms such as OpenCL are
      // supported.)
      se::Platform *platform =

      // Get a StreamExecutor for the chosen Platform. Multiple devices are
      // supported, we indicate here that we want to run on device 0.
      const int device_ordinal = 0;
      se::StreamExecutor *executor =

      // Create a MultiKernelLoaderSpec, which knows where to find the code
for our
      // kernel. In this case, the code is stored in memory as a PTX string.
      // Note that the "arity" and name specified here must match  "arity"
and name
      // of the kernel defined in the PTX string.
      se::MultiKernelLoaderSpec kernel_loader_spec(KERNEL_ARITY);
      kernel_loader_spec.AddCudaPtxInMemory(KERNEL_PTX, KERNEL_NAME);

      // Next create a kernel handle, which we will associate with our
kernel code
      // (i.e., the PTX string).  The type of this handle is a bit verbose,
so we
      // create an alias for it.
      // This specific type represents a kernel that takes two arguments: a
      // point value and a pointer to a floating point value in device
      // A type like this is nice to have because it enables static type
checking of
      // kernel arguments when we enqueue work on a stream.
      using KernelType = se::TypedKernel<float, se::DeviceMemory<float> *>;

      // Now instantiate an object of the specific kernel type we declared
      // The kernel object is not yet connected with the device code that
we want it
      // to run (that happens with the call to GetKernel below), so it
cannot be
      // used to execute work on the device yet.
      // However, the kernel object is not completely empty when it is
created. From
      // the StreamExecutor passed into its constructor it knows which
platform it
      // is targeted for, and it also knows which device it will run on.
      KernelType kernel(executor);

      // Use the MultiKernelLoaderSpec defined above to load the kernel
code onto
      // the device pointed to by the kernel object and to make that kernel
object a
      // handle to the kernel code loaded on that device.
      // The MultiKernelLoaderSpec may contain code for several different
      // but the kernel object has an associated platform, so there is no
      // about which code should be loaded.
      // After this call the kernel object can be used to launch its kernel
on its
      // device.
      executor->GetKernel(kernel_loader_spec, &kernel);

      // Allocate memory in the device memory space to hold the result of
the kernel
      // call. This memory will be freed when this object goes out of scope.
      se::ScopedDeviceMemory<float> result =

      // Create a stream on which to schedule device operations.
      se::Stream stream(executor);

      // Schedule a kernel launch on the new stream and block until the
      // completes. The kernel call executes asynchronously on the device,
so we
      // could do more work on the host before calling BlockHostUntilDone.
      const float kernel_input_argument = 42.5f;
          .ThenLaunch(se::ThreadDim(), se::BlockDim(), kernel,
                      kernel_input_argument, result.ptr())

      // Copy the result of the kernel call from device back to the host.
      float host_result = 0.0f;
      executor->SynchronousMemcpyD2H(result.cref(), sizeof(host_result),

      // Verify that the correct result was computed.
      assert((kernel_input_argument + MYSTERY_VALUE) == host_result);

Kernel Loader Specs

An instance of the class `MultiKernelLoaderSpec` is used to encapsulate
knowledge of where the device code for a kernel is stored and what format
it is in.  Given a `MultiKernelLoaderSpec` and an uninitialized
`TypedKernel`, calling the `StreamExecutor::GetKernel` method will load the
code onto the device and associate the `TypedKernel` instance with that
loaded code. So, in order to initialize a `TypedKernel` instance, it is
first necessary to create a `MultiKernelLoaderSpec`.

A `MultiKernelLoaderSpec` supports a different method for adding device code
for each combination of platform, format, and storage location. The
table shows some examples:

===========     =======         ===========     =========================
Platform        Format          Location        Setter
===========     =======         ===========     =========================
CUDA            PTX             disk            `AddCudaPtxOnDisk`
CUDA            PTX             memory          `AddCudaPtxInMemory`
CUDA            cubin           disk            `AddCudaCubinOnDisk`
CUDA            cubin           memory          `AddCudaCubinInMemory`
OpenCL          text            disk            `AddOpenCLTextOnDisk`
OpenCL          text            memory          `AddOpenCLTextInMemory`
OpenCL          binary          disk            `AddOpenCLBinaryOnDisk`
OpenCL          binary          memory          `AddOpenCLBinaryInMemory`
===========     =======         ===========     =========================

The specific method used in the example is `AddCudaPtxInMemory`, but all
other methods are used similarly.

Compiler Support for StreamExecutor

General strategies

For illustrative purposes, the PTX code in the example is written by hand
and appears as a string literal in the source code file, but it is far more
typical for the kernel code to be expressed in a high level language like
CUDA C++ or OpenCL C and for the device machine code to be generated by a

There are several ways we can load compiled device code using

One possibility is that the build system could write the compiled device
code to a file on disk. This can then be added to a `MultiKernelLoaderSpec`
by using one of the `OnDisk` setters.

Another option is to add a feature to the compiler which embeds the
compiled device code into the host executable and provides some symbol
(probably with a name based on the name of the kernel) that allows the host
code to refer to the embedded code data.

In fact, as discussed below, in the current use of StreamExecutor inside
Google, the compiler goes even further and generates an instance of
`MultiKernelLoaderSpec` for each kernel. This means the application author
doesn't have to know anything about how or where the compiler decided to
store the compiled device code, but instead gets a pre-made loader object
that handles all those details.

Compiler-generated code makes things safe

Two of the steps in the example above are dangerous because they lack
static safety checks: instantiating the `MultiKernelLoaderSpec` and
specializing the `TypedKernel` class template. This section discusses how
compiler support for StreamExecutor can make these steps safe.

Instantiating a `MultiKernelLoaderSpec` requires specifying a three things:

1. the kernel *arity* (number of parameters),
2. the kernel name,
3. a string containing the device machine code for the kernel (either as
assembly, or some sort of object file).

The problem with this is that the kernel name and the number of parameters
is already fully determined by the kernel's machine code. In the best case
scenario the *arity* and name arguments passed to the
`MultiKernelLoaderSpec` methods match the information in the machine code
and are simply redundant, but in the worst case these arguments contradict
the information in the machine code and we get a runtime error when we try
to load the kernel..

The second unsafe operation is specifying the kernel parameter types as
type arguments to the `TypedKernel` class template. The specified types
must match the types defined in the kernel machine code, but again there is
no compile-time checking that these types match. Failure to match these
types will result in a runtime error when the kernel is launched.

We would like the compiler to perform these checks for the application
author, so as to eliminate this source of runtime errors. In particular, we
want the compiler to create an appropriate `MultiKernelLoaderSpec` instance
and `TypedKernel` specialization for each kernel definition.

One of the main goals of open-sourcing StreamExecutor is to let us add this
code generation capability to Clang, when the user has chosen to use
StreamExecutor as their runtime for accelerator operations.

Google has been using an internally developed CUDA compiler based on Clang
called **gpucc** that generates code for StreamExecutor in this way.  The
code below shows how the example above would be written using gpucc to
generate the unsafe parts of the code.

The kernel is defined in a high-level language (CUDA C++ in this example)
in its own file:

.. code-block:: c++

    // File: add_mystery_value.cu

    __global__ void add_mystery_value(float input, float *output) {
      *output = input + 42.0f;

    The host code is defined in another file:

    .. code-block:: c++

    // File: example_host_code.cc

    #include <cassert>

    #include "stream_executor.h"

    // This header is generated by the gpucc compiler and it contains the
    // definitions of gpucc::kernel::AddMysteryValue and
    // gpucc::spec::add_mystery_value().
    // The name of this header file is derived from the name of the file
    // the kernel code. The trailing ".cu" is replaced with ".gpu.h".
    #include "add_mystery_value.gpu.h"

    namespace se = streamexecutor;

    int main(int argc, char *argv[]) {
      se::Platform *platform =

      const int device_ordinal = 0;
      se::StreamExecutor *executor =

      // AddMysteryValue is an instance of TypedKernel generated by gpucc.
      // template arguments are chosen by the compiler to match the
parameters of
      // the add_mystery_value kernel.
      gpucc::kernel::AddMysteryValue kernel(executor);

      // gpucc::spec::add_mystery_value() is generated by gpucc. It returns
      // MultiKernelLoaderSpec that knows how to find  the compiled code
for the
      // add_mystery_value kernel.
      executor->GetKernel(gpucc::spec::add_mystery_value(), &kernel);

      se::ScopedDeviceMemory<float> result =
      se::Stream stream(executor);

      const float kernel_input_argument = 42.5f;

          .ThenLaunch(se::ThreadDim(), se::BlockDim(), kernel,
                      kernel_input_argument, result.ptr())

      float host_result = 0.0f;
      executor->SynchronousMemcpyD2H(result.cref(), sizeof(host_result),

      assert((kernel_input_argument + 42.0f) == host_result);

This support from the compiler makes the use of StreamExecutor safe and

Compiler support for triple angle bracket kernel launches

For even greater ease of use, Google's gpucc CUDA compiler also supports an
integrated mode that looks like NVIDIA's `CUDA programming model`_,which
uses triple angle brackets (`<<<>>>`) to launch kernels.

.. _CUDA programming model:

.. code-block:: c++
    :emphasize-lines: 22

    #include <cassert>

    #include "stream_executor.h"

    namespace se = streamexecutor;

    __global__ void add_mystery_value(float input, float *output) {
      *output = input + 42.0f;

    int main(int argc, char *argv[]) {
      se::Platform *platform =

      const int device_ordinal = 0;
      se::StreamExecutor *executor =

      se::ScopedDeviceMemory<float> result =

      const float kernel_input_argument = 42.5f;
      add_mystery_value<<<1, 1>>>(kernel_input_argument, *result.ptr());

      float host_result = 0.0f;
      executor->SynchronousMemcpyD2H(result.cref(), sizeof(host_result),

      assert((kernel_input_argument + 42.0f) == host_result);

Under the hood, gpucc converts the triple angle bracket kernel call into a
series of calls to the StreamExecutor library similar to the calls seen in
the previous examples.

Clang currently supports the triple angle bracket kernel call syntax for
CUDA compilation by replacing a triple angle bracket call with calls to the
NVIDIA CUDA runtime library, but it would be easy to add a compiler flag to
tell Clang to emit calls to the StreamExecutor library instead. There are
several benefits to supporting this mode of compilation in Clang:

.. _benefits-of-streamexecutor:

* StreamExecutor is a high-level, modern C++ API, so is easier to use and
less prone to error than the NVIDIA CUDA runtime and the OpenCL runtime.
* StreamExecutor will be open-source software, so GPU code will not have to
depend on opaque binary blobs like the NVIDIA CUDA runtime library.
* Using StreamExecutor as the runtime would allow for easy extension of the
triple angle bracket kernel launch syntax to support different accelerator
programming models.

Supporting other platforms

StreamExecutor currently supports CUDA and OpenCL platforms out-of-the-box,
but it uses a platform plugin architecture that makes it easy to add new
platforms at any time. The CUDA and OpenCL platforms are both implemented
as platform plugins in this way, so they serve as good examples for future
platform developers of how to write these kinds of plugins.

Canned operations

StreamExecutor provides several predefined kernels for common data-parallel
operations. The supported classes of operations are:

* BLAS: basic linear algebra subprograms,
* DNN: deep neural networks,
* FFT: fast Fourier transforms, and
* RNG: random number generation.

Here is an example of using a canned operation to perform random number

.. code-block:: c++
    :emphasize-lines: 12-13,17,34-35

    #include <array>

    #include "cuda/cuda_rng.h"
    #include "stream_executor.h"

    namespace se = streamexecutor;

    int main(int argc, char *argv[]) {
      se::Platform *platform =

      se::PluginConfig plugin_config;

      const int device_ordinal = 0;
      se::StreamExecutor *executor =

      const uint8 seed[] = {0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7,
                            0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf};
      constexpr uint64 random_element_count = 1024;

      using HostArray = std::array<float, random_element_count>;

      HostArray host_memory;
      const size_t data_size = host_memory.size() *

      se::ScopedDeviceMemory<float> device_memory =

      se::Stream stream(executor);
          .ThenSetRngSeed(seed, sizeof(seed))

      executor->SynchronousMemcpyD2H(*device_memory.ptr(), data_size,

Each platform plugin can define its own canned operation plugins for these
operations or choose to leave any of them unimplemented.

Contrast with OpenMP

Recent versions of OpenMP also provide a high-level, easy-to-use interface
for running data-parallel workloads on an accelerator device. One big
difference between OpenMP's approach and that of StreamExecutor is that
OpenMP generates both the kernel code that runs on the device and the
host-side code needed to launch the kernel, whereas StreamExecutor only
generates the host-side code. While the OpenMP model provides the
convenience of allowing the author to write their kernel code in standard
C/C++, the StreamExecutor model allows for the use of any kernel language
(e.g. CUDA C++ or OpenCL C). This lets authors use  platform-specific
features that are only present in platform-specific kernel definition

The philosophy of StreamExecutor is that performance is critical on the
device, but less so on the host.  As a result, no attempt is made to use a
high-level device abstraction during device code generation. Instead, the
high-level abstraction provided by StreamExecutor is used only for the
host-side code that moves data and launches kernels.  This host-side work
is tedious and is not performance critical, so it benefits from being
wrapped in a high-level library that can support a wide range of platforms
in an easily extensible manner.

Cooperation with OpenMP

The Clang OpenMP community is currently in the process of `designing their
implementation`_ of offloading support. They will want the compiler to
convert the various standardized target-oriented OpenMP pragmas into device
code to execute on an accelerator and host code to load and run that device
code. StreamExecutor may provide a convenient API for OpenMP to use to
generate their host-side code.

.. _designing their implementation:

In addition to the :ref:`benefits<benefits-of-streamexecutor>` that all
users of StreamExecutor enjoy over the alternative host-side runtime
libraries, OpenMP and StreamExecutor may mutually benefit by sharing work
to support new platforms. If OpenMP makes use of StreamExecutor, then it
should be simple for OpenMP to add support for any new platforms that
StreamExecutor supports in the future. Similarly, for any platforms OpenMP
would like to target, they may add that support in StreamExecutor and take
advantage of the knowledge of platform support in the StreamExecutor
community. The resulting new platform support would then be available not
just within OpenMP, but also to any user of StreamExecutor.

Although OpenMP and StreamExecutor support different programming models,
some of the work they perform under the hood will likely be very similar.
By sharing code and domain expertise, both projects will be improved and
strengthened as their capabilities are expanded. The StreamExecutor
community looks forward to much collaboration and discussion with OpenMP
about the best places and ways to cooperate.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160309/6d7edf69/attachment-0001.html>

More information about the llvm-dev mailing list