[cfe-dev] RFC: Proposing an LLVM subproject for parallelism runtime and support libraries

Wed Mar 9 14:22:56 PST 2016

It's all good. Also good for the Clang folks to be aware of the discussion
and context so they know to follow along on the llvm-dev thread. might also
be good to drop a line to openmp-dev that this discussion is taking place.

On Wed, Mar 9, 2016 at 11:21 PM Jason Henline <jhen at google.com> wrote:

> Thanks for the heads-up Chandler. I've moved this thread to llvm-dev.
> Let's consider this thread closed and move the discussion there.
>
> On Wed, Mar 9, 2016 at 2:12 PM Chandler Carruth <chandlerc at google.com>
> wrote:
>
>> FWIW, LLVM sub-project stuff probably is best discussed on llvm-dev.
>> Maybe re-send there?
>>
>> On Wed, Mar 9, 2016 at 10:30 PM Jason Henline via cfe-dev <
>> cfe-dev at lists.llvm.org> wrote:
>>
>>> At Google we're doing a lot of work on parallel programming models for
>>> CPUs, GPUs and other platforms. One place where we're investing a lot are
>>> parallel libraries, especially those closely tied to compiler technology
>>> like runtime and math libraries. We would like to develop these in the
>>> open, and the natural place seems to be as a subproject in LLVM if others
>>> in the community are interested.
>>>
>>> Initially, we'd like to open source our StreamExecutor runtime library,
>>> which is used for simplifying the management of data-parallel workflows on
>>> accelerator devices and can also be extended to support other hardware
>>> platforms. We'd like to teach Clang to use StreamExecutor when targeting
>>> CUDA and work on other integrations, but that makes much more sense if it
>>> is part of the LLVM project.
>>>
>>> However, we think the LLVM subproject should be organized as a set of
>>> several libraries with StreamExecutor as just the first instance. As just
>>> one example of how creating a unified parallelism subproject could help
>>> with code sharing, the StreamExecutor library contains some nice wrappers
>>> around the CUDA driver API and OpenCL API that create a unified API for
>>> managing all kinds of GPU devices. This unified GPU wrapper would be
>>> broadly applicable for libraries that need to communicate with GPU devices.
>>>
>>> Of course, there is already an LLVM subproject for a parallel runtime
>>> library: OpenMP! So there is a question of how it would fit into this
>>> picture.  Eventually, it might make sense to pull in the OpenMP project as
>>> a library in this proposed new subproject. In particular, there is a good
>>> chance that OpenMP and StreamExecutor could share code for offloading to
>>> GPUs and managing workloads on those devices. This is discussed at the end
>>> of the StreamExecutor documentation below. However, if it turns out that
>>> the needs of OpenMP are too specialized to fit well in a generic
>>> parallelism project, then it may make sense to leave OpenMP as a separate
>>> LLVM subproject so it can focus on serving the particular needs of OpenMP.
>>>
>>> Documentation for the StreamExecutor library that is being proposed for
>>> open-sourcing is included below to give a sense of what it is, in order to
>>> give context for how it might fit into a general parallelism LLVM
>>> subproject.
>>>
>>> What do folks think? Is there general interest in something like this?
>>> If so, we can start working on getting a project in place and sketching out
>>> a skeleton for how it would be organized, as well as contributing
>>> StreamExecutor to it. We're happy to iterate on the particulars to figure
>>> out what works for the community.
>>>
>>>
>>> =============================================
>>> StreamExecutor Runtime Library Documentation
>>> =============================================
>>>
>>>
>>> What is StreamExecutor?
>>> ========================
>>>
>>> **StreamExecutor** is a unified wrapper around the **CUDA** and
>>> **OpenCL** host-side programming models (runtimes). It lets host code
>>> target either CUDA or OpenCL devices with identically-functioning
>>> data-parallel kernels. StreamExecutor manages the execution of concurrent
>>> work targeting the accelerator similarly to how an Executor_ from the
>>> Google APIs client library manages the execution of concurrent work on the
>>> host.
>>>
>>> .. _Executor:
>>> http://google.github.io/google-api-cpp-client/latest/doxygen/classgoogleapis_1_1thread_1_1Executor.html
>>>
>>> StreamExecutor is currently used as the runtime for the vast majority of
>>> Google's internal GPGPU applications, and a snapshot of it is included in
>>> the open-source TensorFlow_ project, where it serves as the GPGPU runtime.
>>>
>>> .. _TensorFlow: https://www.tensorflow.org
>>>
>>> It is currently proposed that StreamExecutor itself be independently
>>> open-sourced. As part of that proposal, this document describes the basics
>>> of its design and explains why it would fit in well as an LLVM subproject.
>>>
>>>
>>> -------------------
>>> Key points
>>> -------------------
>>>
>>> StreamExecutor:
>>>
>>> * abstracts the underlying accelerator platform (avoids locking you into
>>> a single vendor, and lets you write code without thinking about which
>>> platform you'll be running on).
>>> * provides an open-source alternative to the CUDA runtime library.
>>> * gives users a stream management model whose terminology matches that
>>> of the CUDA programming model.
>>> * makes use of modern C++ to create a safe, efficient, easy-to-use
>>> programming interface.
>>>
>>> StreamExecutor makes it easy to:
>>>
>>> * move data between host and accelerator (and also between peer
>>> accelerators).
>>> * execute data-parallel kernels written in the OpenCL or CUDA kernel
>>> languages.
>>> * inspect the capabilities of a GPU-like device at runtime.
>>> * manage multiple devices.
>>>
>>>
>>> --------------------------------
>>> Example code snippet
>>> --------------------------------
>>>
>>> The StreamExecutor API uses abstractions that will be familiar to those
>>> who have worked with other GPU APIs: **Streams**, **Timers**, and
>>> **Kernels**. Its API is *fluent*, meaning that it allows the user to chain
>>> together a sequence of related operations on a stream, as in the following
>>> code snippet:
>>>
>>> .. code-block:: c++
>>>
>>>   se::Stream stream(executor);
>>>   se::Timer timer(executor);
>>>   stream.InitWithTimer(&timer)
>>>       .ThenStartTimer(&timer)
>>>       .ThenLaunch(se::ThreadDim(dim_block_x, dim_block_y),
>>>                   se::BlockDim(dim_grid_x, dim_grid_y),
>>>                   my_kernel,
>>>                   arg0, arg1, arg2)
>>>       .ThenStopTimer(&timer)
>>>       .BlockHostUntilDone();
>>>
>>> The name of the kernel being launched in the snippet above is
>>> `my_kernel` and the arguments being passed to the kernel are `arg0`,
>>> `arg1`, and `arg2`. Kernels with any number of arguments of any types are
>>> supported, and the number and types of the arguments is checked at compile
>>> time.
>>>
>>> How does it work?
>>> =======================
>>>
>>>
>>> --------------------------------
>>> Detailed example
>>> --------------------------------
>>>
>>> The following example shows how we can use StreamExecutor to create a
>>> `TypedKernel` instance, associate device code with that instance, and then
>>> use that instance to schedule work on an accelerator device.
>>>
>>> .. code-block:: c++
>>>
>>>     #include <cassert>
>>>
>>>     #include "stream_executor.h"
>>>
>>>     namespace se = streamexecutor;
>>>
>>>     // A PTX string defining a CUDA kernel.
>>>     //
>>>     // This PTX string represents a kernel that takes two arguments: an
>>> input value
>>>     // and an output pointer. The input value is a floating point
>>> number. The output
>>>     // value is a pointer to a floating point value in device memory.
>>> The output
>>>     // pointer is where the output from the kernel will be written.
>>>     //
>>>     // The kernel adds a fixed floating point value to the input and
>>> writes the
>>>     // result to the output location.
>>>     static constexpr const char *KERNEL_PTX = R"(
>>>         .version 3.1
>>>         .target sm_20
>>>         .address_size 64
>>>         .visible .entry add_mystery_value(
>>>             .param .f32 float_literal,
>>>             .param .u64 result_loc
>>>             ) {
>>>           .reg .u64 %rl<2>;
>>>           .reg .f32 %f<2>;
>>>           ld.param.f32 %f1, [float_literal];
>>>           ld.param.u64 %rl1, [result_loc];
>>>           add.f32 %f1, %f1, 123.0;
>>>           st.f32 [%rl1], %f1;
>>>           ret;
>>>         }
>>>         )";
>>>
>>>     // The number of arguments expected by the kernel described in
>>>     // KERNEL_PTX_TEMPLATE.
>>>     static constexpr int KERNEL_ARITY = 2;
>>>
>>>     // The name of the kernel described in KERNEL_PTX.
>>>     static constexpr const char *KERNEL_NAME = "add_mystery_value";
>>>
>>>     // The value added to the input in the kernel described in
>>> KERNEL_PTX.
>>>     static constexpr float MYSTERY_VALUE = 123.0f;
>>>
>>>     int main(int argc, char *argv[]) {
>>>       // Get a CUDA Platform object. (Other platforms such as OpenCL are
>>> also
>>>       // supported.)
>>>       se::Platform *platform =
>>>
>>> se::MultiPlatformManager::PlatformWithName("cuda").ValueOrDie();
>>>
>>>       // Get a StreamExecutor for the chosen Platform. Multiple devices
>>> are
>>>       // supported, we indicate here that we want to run on device 0.
>>>       const int device_ordinal = 0;
>>>       se::StreamExecutor *executor =
>>>           platform->ExecutorForDevice(device_ordinal).ValueOrDie();
>>>
>>>       // Create a MultiKernelLoaderSpec, which knows where to find the
>>> code for our
>>>       // kernel. In this case, the code is stored in memory as a PTX
>>> string.
>>>       //
>>>       // Note that the "arity" and name specified here must match
>>>  "arity" and name
>>>       // of the kernel defined in the PTX string.
>>>       se::MultiKernelLoaderSpec kernel_loader_spec(KERNEL_ARITY);
>>>       kernel_loader_spec.AddCudaPtxInMemory(KERNEL_PTX, KERNEL_NAME);
>>>
>>>       // Next create a kernel handle, which we will associate with our
>>> kernel code
>>>       // (i.e., the PTX string).  The type of this handle is a bit
>>> verbose, so we
>>>       // create an alias for it.
>>>       //
>>>       // This specific type represents a kernel that takes two
>>> arguments: a floating
>>>       // point value and a pointer to a floating point value in device
>>> memory.
>>>       //
>>>       // A type like this is nice to have because it enables static type
>>> checking of
>>>       // kernel arguments when we enqueue work on a stream.
>>>       using KernelType = se::TypedKernel<float, se::DeviceMemory<float>
>>> *>;
>>>
>>>       // Now instantiate an object of the specific kernel type we
>>> declared above.
>>>       // The kernel object is not yet connected with the device code
>>> that we want it
>>>       // to run (that happens with the call to GetKernel below), so it
>>> cannot be
>>>       // used to execute work on the device yet.
>>>       //
>>>       // However, the kernel object is not completely empty when it is
>>> created. From
>>>       // the StreamExecutor passed into its constructor it knows which
>>> platform it
>>>       // is targeted for, and it also knows which device it will run on.
>>>       KernelType kernel(executor);
>>>
>>>       // Use the MultiKernelLoaderSpec defined above to load the kernel
>>> code onto
>>>       // the device pointed to by the kernel object and to make that
>>> kernel object a
>>>       // handle to the kernel code loaded on that device.
>>>       //
>>>       // The MultiKernelLoaderSpec may contain code for several
>>> different platforms,
>>>       // but the kernel object has an associated platform, so there is
>>> no confusion
>>>       // about which code should be loaded.
>>>       //
>>>       // After this call the kernel object can be used to launch its
>>> kernel on its
>>>       // device.
>>>       executor->GetKernel(kernel_loader_spec, &kernel);
>>>
>>>       // Allocate memory in the device memory space to hold the result
>>> of the kernel
>>>       // call. This memory will be freed when this object goes out of
>>> scope.
>>>       se::ScopedDeviceMemory<float> result =
>>> executor->AllocateOwnedScalar<float>();
>>>
>>>       // Create a stream on which to schedule device operations.
>>>       se::Stream stream(executor);
>>>
>>>       // Schedule a kernel launch on the new stream and block until the
>>> kernel
>>>       // completes. The kernel call executes asynchronously on the
>>> device, so we
>>>       // could do more work on the host before calling
>>> BlockHostUntilDone.
>>>       const float kernel_input_argument = 42.5f;
>>>       stream.Init()
>>>           .ThenLaunch(se::ThreadDim(), se::BlockDim(), kernel,
>>>                       kernel_input_argument, result.ptr())
>>>           .BlockHostUntilDone();
>>>
>>>       // Copy the result of the kernel call from device back to the host.
>>>       float host_result = 0.0f;
>>>       executor->SynchronousMemcpyD2H(result.cref(), sizeof(host_result),
>>>                                      &host_result);
>>>
>>>       // Verify that the correct result was computed.
>>>       assert((kernel_input_argument + MYSTERY_VALUE) == host_result);
>>>     }
>>>
>>>
>>> --------------------------------
>>> Kernel Loader Specs
>>> --------------------------------
>>>
>>> An instance of the class `MultiKernelLoaderSpec` is used to encapsulate
>>> knowledge of where the device code for a kernel is stored and what format
>>> it is in.  Given a `MultiKernelLoaderSpec` and an uninitialized
>>> `TypedKernel`, calling the `StreamExecutor::GetKernel` method will load the
>>> code onto the device and associate the `TypedKernel` instance with that
>>> loaded code. So, in order to initialize a `TypedKernel` instance, it is
>>> first necessary to create a `MultiKernelLoaderSpec`.
>>>
>>> A `MultiKernelLoaderSpec` supports a different method for adding device
>>> code
>>> for each combination of platform, format, and storage location. The
>>> following
>>> table shows some examples:
>>>
>>> ===========     =======         ===========     =========================
>>> Platform        Format          Location        Setter
>>> ===========     =======         ===========     =========================
>>> CUDA            PTX             disk            `AddCudaPtxOnDisk`
>>> CUDA            PTX             memory          `AddCudaPtxInMemory`
>>> CUDA            cubin           disk            `AddCudaCubinOnDisk`
>>> CUDA            cubin           memory          `AddCudaCubinInMemory`
>>> OpenCL          text            disk            `AddOpenCLTextOnDisk`
>>> OpenCL          text            memory          `AddOpenCLTextInMemory`
>>> OpenCL          binary          disk            `AddOpenCLBinaryOnDisk`
>>> OpenCL          binary          memory          `AddOpenCLBinaryInMemory`
>>> ===========     =======         ===========     =========================
>>>
>>> The specific method used in the example is `AddCudaPtxInMemory`, but all
>>> other methods are used similarly.
>>>
>>>
>>> ------------------------------------
>>> Compiler Support for StreamExecutor
>>> ------------------------------------
>>>
>>>
>>> General strategies
>>> -------------------
>>>
>>> For illustrative purposes, the PTX code in the example is written by
>>> hand and appears as a string literal in the source code file, but it is far
>>> more typical for the kernel code to be expressed in a high level language
>>> like CUDA C++ or OpenCL C and for the device machine code to be generated
>>> by a compiler.
>>>
>>> There are several ways we can load compiled device code using
>>> StreamExecutor.
>>>
>>> One possibility is that the build system could write the compiled device
>>> code to a file on disk. This can then be added to a `MultiKernelLoaderSpec`
>>> by using one of the `OnDisk` setters.
>>>
>>> Another option is to add a feature to the compiler which embeds the
>>> compiled device code into the host executable and provides some symbol
>>> (probably with a name based on the name of the kernel) that allows the host
>>> code to refer to the embedded code data.
>>>
>>> In fact, as discussed below, in the current use of StreamExecutor inside
>>> Google, the compiler goes even further and generates an instance of
>>> `MultiKernelLoaderSpec` for each kernel. This means the application author
>>> doesn't have to know anything about how or where the compiler decided to
>>> store the compiled device code, but instead gets a pre-made loader object
>>> that handles all those details.
>>>
>>>
>>> Compiler-generated code makes things safe
>>> --------------------------------------------
>>>
>>> Two of the steps in the example above are dangerous because they lack
>>> static safety checks: instantiating the `MultiKernelLoaderSpec` and
>>> specializing the `TypedKernel` class template. This section discusses how
>>> compiler support for StreamExecutor can make these steps safe.
>>>
>>> Instantiating a `MultiKernelLoaderSpec` requires specifying a three
>>> things:
>>>
>>> 1. the kernel *arity* (number of parameters),
>>> 2. the kernel name,
>>> 3. a string containing the device machine code for the kernel (either as
>>> assembly, or some sort of object file).
>>>
>>> The problem with this is that the kernel name and the number of
>>> parameters is already fully determined by the kernel's machine code. In the
>>> best case scenario the *arity* and name arguments passed to the
>>> `MultiKernelLoaderSpec` methods match the information in the machine code
>>> and are simply redundant, but in the worst case these arguments contradict
>>> the information in the machine code and we get a runtime error when we try
>>> to load the kernel..
>>>
>>> The second unsafe operation is specifying the kernel parameter types as
>>> type arguments to the `TypedKernel` class template. The specified types
>>> must match the types defined in the kernel machine code, but again there is
>>> no compile-time checking that these types match. Failure to match these
>>> types will result in a runtime error when the kernel is launched.
>>>
>>> We would like the compiler to perform these checks for the application
>>> author, so as to eliminate this source of runtime errors. In particular, we
>>> want the compiler to create an appropriate `MultiKernelLoaderSpec` instance
>>> and `TypedKernel` specialization for each kernel definition.
>>>
>>> One of the main goals of open-sourcing StreamExecutor is to let us add
>>> this code generation capability to Clang, when the user has chosen to use
>>> StreamExecutor as their runtime for accelerator operations.
>>>
>>> Google has been using an internally developed CUDA compiler based on
>>> Clang called **gpucc** that generates code for StreamExecutor in this way.
>>> The code below shows how the example above would be written using gpucc to
>>> generate the unsafe parts of the code.
>>>
>>> The kernel is defined in a high-level language (CUDA C++ in this
>>> example) in its own file:
>>>
>>> .. code-block:: c++
>>>
>>>     // File: add_mystery_value.cu
>>>
>>>     __global__ void add_mystery_value(float input, float *output) {
>>>       *output = input + 42.0f;
>>>     }
>>>
>>>     The host code is defined in another file:
>>>
>>>     .. code-block:: c++
>>>
>>>     // File: example_host_code.cc
>>>
>>>     #include <cassert>
>>>
>>>     #include "stream_executor.h"
>>>
>>>     // This header is generated by the gpucc compiler and it contains the
>>>     // definitions of gpucc::kernel::AddMysteryValue and
>>>     // gpucc::spec::add_mystery_value().
>>>     //
>>>     // The name of this header file is derived from the name of the file
>>> containing
>>>     // the kernel code. The trailing ".cu" is replaced with ".gpu.h".
>>>     #include "add_mystery_value.gpu.h"
>>>
>>>     namespace se = streamexecutor;
>>>
>>>     int main(int argc, char *argv[]) {
>>>       se::Platform *platform =
>>>
>>> se::MultiPlatformManager::PlatformWithName("cuda").ValueOrDie();
>>>
>>>       const int device_ordinal = 0;
>>>       se::StreamExecutor *executor =
>>>           platform->ExecutorForDevice(device_ordinal).ValueOrDie();
>>>
>>>       // AddMysteryValue is an instance of TypedKernel generated by
>>> gpucc. The
>>>       // template arguments are chosen by the compiler to match the
>>> parameters of
>>>       // the add_mystery_value kernel.
>>>       gpucc::kernel::AddMysteryValue kernel(executor);
>>>
>>>       // gpucc::spec::add_mystery_value() is generated by gpucc. It
>>> returns a
>>>       // MultiKernelLoaderSpec that knows how to find  the compiled code
>>> for the
>>>       // add_mystery_value kernel.
>>>       executor->GetKernel(gpucc::spec::add_mystery_value(), &kernel);
>>>
>>>       se::ScopedDeviceMemory<float> result =
>>> executor->AllocateOwnedScalar<float>();
>>>       se::Stream stream(executor);
>>>
>>>       const float kernel_input_argument = 42.5f;
>>>
>>>       stream.Init()
>>>           .ThenLaunch(se::ThreadDim(), se::BlockDim(), kernel,
>>>                       kernel_input_argument, result.ptr())
>>>           .BlockHostUntilDone();
>>>
>>>       float host_result = 0.0f;
>>>       executor->SynchronousMemcpyD2H(result.cref(), sizeof(host_result),
>>>                                      &host_result);
>>>
>>>       assert((kernel_input_argument + 42.0f) == host_result);
>>>     }
>>>
>>> This support from the compiler makes the use of StreamExecutor safe and
>>> easy.
>>>
>>>
>>> Compiler support for triple angle bracket kernel launches
>>> ----------------------------------------------------------
>>>
>>> For even greater ease of use, Google's gpucc CUDA compiler also supports
>>> an integrated mode that looks like NVIDIA's `CUDA programming model`_,which
>>> uses triple angle brackets (`<<<>>>`) to launch kernels.
>>>
>>> .. _CUDA programming model:
>>> http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#kernels
>>>
>>> .. code-block:: c++
>>>     :emphasize-lines: 22
>>>
>>>     #include <cassert>
>>>
>>>     #include "stream_executor.h"
>>>
>>>     namespace se = streamexecutor;
>>>
>>>     __global__ void add_mystery_value(float input, float *output) {
>>>       *output = input + 42.0f;
>>>     }
>>>
>>>     int main(int argc, char *argv[]) {
>>>       se::Platform *platform =
>>>
>>> se::MultiPlatformManager::PlatformWithName("cuda").ValueOrDie();
>>>
>>>       const int device_ordinal = 0;
>>>       se::StreamExecutor *executor =
>>>           platform->ExecutorForDevice(device_ordinal).ValueOrDie();
>>>
>>>       se::ScopedDeviceMemory<float> result =
>>> executor->AllocateOwnedScalar<float>();
>>>
>>>       const float kernel_input_argument = 42.5f;
>>>       add_mystery_value<<<1, 1>>>(kernel_input_argument, *result.ptr());
>>>
>>>       float host_result = 0.0f;
>>>       executor->SynchronousMemcpyD2H(result.cref(), sizeof(host_result),
>>>                                      &host_result);
>>>
>>>       assert((kernel_input_argument + 42.0f) == host_result);
>>>     }
>>>
>>> Under the hood, gpucc converts the triple angle bracket kernel call into
>>> a series of calls to the StreamExecutor library similar to the calls seen
>>> in the previous examples.
>>>
>>> Clang currently supports the triple angle bracket kernel call syntax for
>>> CUDA compilation by replacing a triple angle bracket call with calls to the
>>> NVIDIA CUDA runtime library, but it would be easy to add a compiler flag to
>>> tell Clang to emit calls to the StreamExecutor library instead. There are
>>> several benefits to supporting this mode of compilation in Clang:
>>>
>>> .. _benefits-of-streamexecutor:
>>>
>>> * StreamExecutor is a high-level, modern C++ API, so is easier to use
>>> and less prone to error than the NVIDIA CUDA runtime and the OpenCL runtime.
>>> * StreamExecutor will be open-source software, so GPU code will not have
>>> to depend on opaque binary blobs like the NVIDIA CUDA runtime library.
>>> * Using StreamExecutor as the runtime would allow for easy extension of
>>> the triple angle bracket kernel launch syntax to support different
>>> accelerator programming models.
>>>
>>>
>>> Supporting other platforms
>>> ===========================
>>>
>>> StreamExecutor currently supports CUDA and OpenCL platforms
>>> out-of-the-box, but it uses a platform plugin architecture that makes it
>>> easy to add new platforms at any time. The CUDA and OpenCL platforms are
>>> both implemented as platform plugins in this way, so they serve as good
>>> examples for future platform developers of how to write these kinds of
>>> plugins.
>>>
>>>
>>> Canned operations
>>> ==================
>>>
>>> StreamExecutor provides several predefined kernels for common
>>> data-parallel operations. The supported classes of operations are:
>>>
>>> * BLAS: basic linear algebra subprograms,
>>> * DNN: deep neural networks,
>>> * FFT: fast Fourier transforms, and
>>> * RNG: random number generation.
>>>
>>> Here is an example of using a canned operation to perform random number
>>> generation:
>>>
>>> .. code-block:: c++
>>>     :emphasize-lines: 12-13,17,34-35
>>>
>>>     #include <array>
>>>
>>>     #include "cuda/cuda_rng.h"
>>>     #include "stream_executor.h"
>>>
>>>     namespace se = streamexecutor;
>>>
>>>     int main(int argc, char *argv[]) {
>>>       se::Platform *platform =
>>>
>>> se::MultiPlatformManager::PlatformWithName("cuda").ValueOrDie();
>>>
>>>       se::PluginConfig plugin_config;
>>>       plugin_config.SetRng(se::cuda::kCuRandPlugin);
>>>
>>>       const int device_ordinal = 0;
>>>       se::StreamExecutor *executor =
>>>           platform->ExecutorForDeviceWithPluginConfig(device_ordinal,
>>> plugin_config)
>>>               .ValueOrDie();
>>>
>>>       const uint8 seed[] = {0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7,
>>>                             0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf};
>>>       constexpr uint64 random_element_count = 1024;
>>>
>>>       using HostArray = std::array<float, random_element_count>;
>>>
>>>       HostArray host_memory;
>>>       const size_t data_size = host_memory.size() *
>>> sizeof(HostArray::value_type);
>>>
>>>       se::ScopedDeviceMemory<float> device_memory =
>>>           executor->AllocateOwnedArray<float>(random_element_count);
>>>
>>>       se::Stream stream(executor);
>>>       stream.Init()
>>>           .ThenSetRngSeed(seed, sizeof(seed))
>>>           .ThenPopulateRandUniform(device_memory.ptr())
>>>           .BlockHostUntilDone();
>>>
>>>       executor->SynchronousMemcpyD2H(*device_memory.ptr(), data_size,
>>>                                      host_memory.data());
>>>     }
>>>
>>> Each platform plugin can define its own canned operation plugins for
>>> these operations or choose to leave any of them unimplemented.
>>>
>>>
>>> Contrast with OpenMP
>>> =====================
>>>
>>> Recent versions of OpenMP also provide a high-level, easy-to-use
>>> interface for running data-parallel workloads on an accelerator device. One
>>> big difference between OpenMP's approach and that of StreamExecutor is that
>>> OpenMP generates both the kernel code that runs on the device and the
>>> host-side code needed to launch the kernel, whereas StreamExecutor only
>>> generates the host-side code. While the OpenMP model provides the
>>> convenience of allowing the author to write their kernel code in standard
>>> C/C++, the StreamExecutor model allows for the use of any kernel language
>>> (e.g. CUDA C++ or OpenCL C). This lets authors use  platform-specific
>>> features that are only present in platform-specific kernel definition
>>> languages.
>>>
>>> The philosophy of StreamExecutor is that performance is critical on the
>>> device, but less so on the host.  As a result, no attempt is made to use a
>>> high-level device abstraction during device code generation. Instead, the
>>> high-level abstraction provided by StreamExecutor is used only for the
>>> host-side code that moves data and launches kernels.  This host-side work
>>> is tedious and is not performance critical, so it benefits from being
>>> wrapped in a high-level library that can support a wide range of platforms
>>> in an easily extensible manner.
>>>
>>>
>>> Cooperation with OpenMP
>>> ========================
>>>
>>> The Clang OpenMP community is currently in the process of `designing
>>> their implementation`_ of offloading support. They will want the compiler
>>> to convert the various standardized target-oriented OpenMP pragmas into
>>> device code to execute on an accelerator and host code to load and run that
>>> device code. StreamExecutor may provide a convenient API for OpenMP to use
>>> to generate their host-side code.
>>>
>>> .. _designing their implementation:
>>> https://drive.google.com/a/google.com/file/d/0B-jX56_FbGKRM21sYlNYVnB4eFk/view
>>>
>>> In addition to the :ref:`benefits<benefits-of-streamexecutor>` that all
>>> users of StreamExecutor enjoy over the alternative host-side runtime
>>> libraries, OpenMP and StreamExecutor may mutually benefit by sharing work
>>> to support new platforms. If OpenMP makes use of StreamExecutor, then it
>>> should be simple for OpenMP to add support for any new platforms that
>>> StreamExecutor supports in the future. Similarly, for any platforms OpenMP
>>> would like to target, they may add that support in StreamExecutor and take
>>> advantage of the knowledge of platform support in the StreamExecutor
>>> community. The resulting new platform support would then be available not
>>> just within OpenMP, but also to any user of StreamExecutor.
>>>
>>> Although OpenMP and StreamExecutor support different programming models,
>>> some of the work they perform under the hood will likely be very similar.
>>> By sharing code and domain expertise, both projects will be improved and
>>> strengthened as their capabilities are expanded. The StreamExecutor
>>> community looks forward to much collaboration and discussion with OpenMP
>>> about the best places and ways to cooperate.
>>>
>>> _______________________________________________
>>> cfe-dev mailing list
>>> cfe-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20160309/5144033a/attachment.html>