[llvm-dev] RFC: Proposing an LLVM subproject for parallelism runtime and support libraries
Jason Henline via llvm-dev
llvm-dev at lists.llvm.org
Thu Mar 10 11:35:16 PST 2016
Do you currently support inter-stream synchronization generally?
At this point, it can only be done with the cooperation of the host. The
host would have to call a blocking `wait` on first one stream and then the
other.
In case you can't already tell, I'm supportive of this kind of
functionality in LLVM's ecosystem. I'm excited that this might be a
near-term possibility (especially once you have the ability to execute host
tasks). There might also be a relationship between this library and what we
need to implement the upcoming C++17 parallel algorithms library.
I'm very glad to hear it, and thank you for all your input.
On Wed, Mar 9, 2016 at 6:00 PM Hal Finkel <hfinkel at anl.gov> wrote:
> *From: *"Jason Henline" <jhen at google.com>
> *To: *"Hal Finkel" <hfinkel at anl.gov>
> *Cc: *llvm-dev at lists.llvm.org
>
> *Sent: *Wednesday, March 9, 2016 7:16:01 PM
>
>
> *Subject: *Re: [llvm-dev] RFC: Proposing an LLVM subproject for
> parallelism runtime and support libraries
>
> Thanks for your input, Hal.
>
> I think that having support for running host-side tasks makes this library
> much more useful, not only for acceleratorless systems, but even for those
> with accelerators (especially if you can place dependency edges between the
> host tasks and the accelerator tasks).
>
> Based on your comments, I think that supporting host-side tasks sounds
> like something that should be added to our roadmap, and it should be pretty
> simple to do within the current model.
>
> Great!
>
> However, supporting dependency edges between different "platforms" (as
> StreamExecutor calls them) such as host and GPU could be slightly more
> challenging. The current model organizes each stream of execution as
> belonging to a parent platform, and streams are the structures that are
> meant to manage dependency edges. It will probably take some thought to
> decide how to do that in the right way.
>
> You might get a fair amount of millage just by allowing host tasks to be
> inserted into the stream of device tasks (I'm making assumptions here based
> on how CUDA streams work). Do you currently support inter-stream
> synchronization generally?
>
>
> Also, does your implementation support, or do you plan on supporting,
> CUDA-style unified memory between host and device?
>
> I'm not sure how much this had been considered before you mentioned it. It
> is not supported right now, but I think it would fit naturally into the
> design model. Currently a custom C++ type is used to represent device
> memory, and so we could add a sub-type to represent unified memory. In
> fact, a similar type of thing is already done for the host platform where
> memcpy operations between host and platform are converted to nops. I think
> this would be a pretty easy change in the current framework.
>
> Interesting. Definitely worth talking about (although probably on some
> other dedicated thread).
>
> In case you can't already tell, I'm supportive of this kind of
> functionality in LLVM's ecosystem. I'm excited that this might be a
> near-term possibility (especially once you have the ability to execute host
> tasks). There might also be a relationship between this library and what we
> need to implement the upcoming C++17 parallel algorithms library.
>
> Thanks again,
> Hal
>
>
>
> On Wed, Mar 9, 2016 at 4:42 PM Hal Finkel <hfinkel at anl.gov> wrote:
>
>>
>> ------------------------------
>>
>> *From: *"Jason Henline" <jhen at google.com>
>> *To: *"Hal Finkel" <hfinkel at anl.gov>
>> *Cc: *llvm-dev at lists.llvm.org
>> *Sent: *Wednesday, March 9, 2016 5:04:53 PM
>> *Subject: *Re: [llvm-dev] RFC: Proposing an LLVM subproject for
>> parallelism runtime and support libraries
>>
>>
>>
>> Hi Hal,
>>
>> Thanks for taking a look at the proposal.
>>
>> The current version of StreamExecutor has partial support for a "host"
>> platform which performs work on the CPU. It's interface is the same as the
>> that of the CUDA platform discussed in the design documentation, but right
>> now it does not support launching user-defined kernels, so it is very
>> limited. The host platform does manage a thread pool internally and uses
>> those threads to execute the "canned" StreamExecutor operations (BLAS, FFT,
>> etc.), but that's all it can currently do. I think it would be relatively
>> easy to extend the host platform to support launching of user-defined
>> kernels, and then its functionality may overlap quite a bit with OpenMP,
>> but we don't have any active plans to add that support at this time.
>>
>> I think that having support for running host-side tasks makes this
>> library much more useful, not only for acceleratorless systems, but even
>> for those with accelerators (especially if you can place dependency edges
>> between the host tasks and the accelerator tasks).
>>
>> Also, does your implementation support, or do you plan on supporting,
>> CUDA-style unified memory between host and device?
>>
>> Thanks again,
>> Hal
>>
>> However, this is something we may pursue in the long run because of the
>> added flexibility it would provide for porting accelerator code to a device
>> without an accelerator, etc.
>>
>> On Wed, Mar 9, 2016 at 2:31 PM Hal Finkel <hfinkel at anl.gov> wrote:
>>
>>>
>>> ------------------------------
>>>
>>> *From: *"Jason Henline via llvm-dev" <llvm-dev at lists.llvm.org>
>>> *To: *llvm-dev at lists.llvm.org
>>> *Sent: *Wednesday, March 9, 2016 4:20:15 PM
>>> *Subject: *[llvm-dev] RFC: Proposing an LLVM subproject for
>>> parallelism runtime and support libraries
>>>
>>>
>>>
>>> At Google we're doing a lot of work on parallel programming models for
>>> CPUs, GPUs and other platforms. One place where we're investing a lot are
>>> parallel libraries, especially those closely tied to compiler technology
>>> like runtime and math libraries. We would like to develop these in the
>>> open, and the natural place seems to be as a subproject in LLVM if others
>>> in the community are interested.
>>>
>>> Initially, we'd like to open source our StreamExecutor runtime library,
>>> which is used for simplifying the management of data-parallel workflows on
>>> accelerator devices and can also be extended to support other hardware
>>> platforms. We'd like to teach Clang to use StreamExecutor when targeting
>>> CUDA and work on other integrations, but that makes much more sense if it
>>> is part of the LLVM project.
>>>
>>> However, we think the LLVM subproject should be organized as a set of
>>> several libraries with StreamExecutor as just the first instance. As just
>>> one example of how creating a unified parallelism subproject could help
>>> with code sharing, the StreamExecutor library contains some nice wrappers
>>> around the CUDA driver API and OpenCL API that create a unified API for
>>> managing all kinds of GPU devices. This unified GPU wrapper would be
>>> broadly applicable for libraries that need to communicate with GPU devices.
>>>
>>> Of course, there is already an LLVM subproject for a parallel runtime
>>> library: OpenMP! So there is a question of how it would fit into this
>>> picture. Eventually, it might make sense to pull in the OpenMP project as
>>> a library in this proposed new subproject. In particular, there is a good
>>> chance that OpenMP and StreamExecutor could share code for offloading to
>>> GPUs and managing workloads on those devices. This is discussed at the end
>>> of the StreamExecutor documentation below. However, if it turns out that
>>> the needs of OpenMP are too specialized to fit well in a generic
>>> parallelism project, then it may make sense to leave OpenMP as a separate
>>> LLVM subproject so it can focus on serving the particular needs of OpenMP.
>>>
>>> The document starts by talking about work you're doing on "CPUs, GPUs
>>> and other platforms", but you've only really discussed accelerators here.
>>> I'm wondering if there is any overlap, either current or planned, with the
>>> functionality that host-side OpenMP provides. For example, is there some
>>> kind of host-side thread pool / task queue?
>>>
>>> Thanks in advance,
>>> Hal
>>>
>>> P.S. I'm really happy that it looks like you have a sane API here for
>>> handling multi-GPU systems. Dealing with cudaSetDevice is a real pain.
>>>
>>>
>>> Documentation for the StreamExecutor library that is being proposed for
>>> open-sourcing is included below to give a sense of what it is, in order to
>>> give context for how it might fit into a general parallelism LLVM
>>> subproject.
>>>
>>> What do folks think? Is there general interest in something like this?
>>> If so, we can start working on getting a project in place and sketching out
>>> a skeleton for how it would be organized, as well as contributing
>>> StreamExecutor to it. We're happy to iterate on the particulars to figure
>>> out what works for the community.
>>>
>>>
>>> =============================================
>>> StreamExecutor Runtime Library Documentation
>>> =============================================
>>>
>>>
>>> What is StreamExecutor?
>>> ========================
>>>
>>> **StreamExecutor** is a unified wrapper around the **CUDA** and
>>> **OpenCL** host-side programming models (runtimes). It lets host code
>>> target either CUDA or OpenCL devices with identically-functioning
>>> data-parallel kernels. StreamExecutor manages the execution of concurrent
>>> work targeting the accelerator similarly to how an Executor_ from the
>>> Google APIs client library manages the execution of concurrent work on the
>>> host.
>>>
>>> .. _Executor:
>>> http://google.github.io/google-api-cpp-client/latest/doxygen/classgoogleapis_1_1thread_1_1Executor.html
>>>
>>> StreamExecutor is currently used as the runtime for the vast majority of
>>> Google's internal GPGPU applications, and a snapshot of it is included in
>>> the open-source TensorFlow_ project, where it serves as the GPGPU runtime.
>>>
>>> .. _TensorFlow: https://www.tensorflow.org
>>>
>>> It is currently proposed that StreamExecutor itself be independently
>>> open-sourced. As part of that proposal, this document describes the basics
>>> of its design and explains why it would fit in well as an LLVM subproject.
>>>
>>>
>>> -------------------
>>> Key points
>>> -------------------
>>>
>>> StreamExecutor:
>>>
>>> * abstracts the underlying accelerator platform (avoids locking you into
>>> a single vendor, and lets you write code without thinking about which
>>> platform you'll be running on).
>>> * provides an open-source alternative to the CUDA runtime library.
>>> * gives users a stream management model whose terminology matches that
>>> of the CUDA programming model.
>>> * makes use of modern C++ to create a safe, efficient, easy-to-use
>>> programming interface.
>>>
>>> StreamExecutor makes it easy to:
>>>
>>> * move data between host and accelerator (and also between peer
>>> accelerators).
>>> * execute data-parallel kernels written in the OpenCL or CUDA kernel
>>> languages.
>>> * inspect the capabilities of a GPU-like device at runtime.
>>> * manage multiple devices.
>>>
>>>
>>> --------------------------------
>>> Example code snippet
>>> --------------------------------
>>>
>>> The StreamExecutor API uses abstractions that will be familiar to those
>>> who have worked with other GPU APIs: **Streams**, **Timers**, and
>>> **Kernels**. Its API is *fluent*, meaning that it allows the user to chain
>>> together a sequence of related operations on a stream, as in the following
>>> code snippet:
>>>
>>> .. code-block:: c++
>>>
>>> se::Stream stream(executor);
>>> se::Timer timer(executor);
>>> stream.InitWithTimer(&timer)
>>> .ThenStartTimer(&timer)
>>> .ThenLaunch(se::ThreadDim(dim_block_x, dim_block_y),
>>> se::BlockDim(dim_grid_x, dim_grid_y),
>>> my_kernel,
>>> arg0, arg1, arg2)
>>> .ThenStopTimer(&timer)
>>> .BlockHostUntilDone();
>>>
>>> The name of the kernel being launched in the snippet above is
>>> `my_kernel` and the arguments being passed to the kernel are `arg0`,
>>> `arg1`, and `arg2`. Kernels with any number of arguments of any types are
>>> supported, and the number and types of the arguments is checked at compile
>>> time.
>>>
>>> How does it work?
>>> =======================
>>>
>>>
>>> --------------------------------
>>> Detailed example
>>> --------------------------------
>>>
>>> The following example shows how we can use StreamExecutor to create a
>>> `TypedKernel` instance, associate device code with that instance, and then
>>> use that instance to schedule work on an accelerator device.
>>>
>>> .. code-block:: c++
>>>
>>> #include <cassert>
>>>
>>> #include "stream_executor.h"
>>>
>>> namespace se = streamexecutor;
>>>
>>> // A PTX string defining a CUDA kernel.
>>> //
>>> // This PTX string represents a kernel that takes two arguments: an
>>> input value
>>> // and an output pointer. The input value is a floating point
>>> number. The output
>>> // value is a pointer to a floating point value in device memory.
>>> The output
>>> // pointer is where the output from the kernel will be written.
>>> //
>>> // The kernel adds a fixed floating point value to the input and
>>> writes the
>>> // result to the output location.
>>> static constexpr const char *KERNEL_PTX = R"(
>>> .version 3.1
>>> .target sm_20
>>> .address_size 64
>>> .visible .entry add_mystery_value(
>>> .param .f32 float_literal,
>>> .param .u64 result_loc
>>> ) {
>>> .reg .u64 %rl<2>;
>>> .reg .f32 %f<2>;
>>> ld.param.f32 %f1, [float_literal];
>>> ld.param.u64 %rl1, [result_loc];
>>> add.f32 %f1, %f1, 123.0;
>>> st.f32 [%rl1], %f1;
>>> ret;
>>> }
>>> )";
>>>
>>> // The number of arguments expected by the kernel described in
>>> // KERNEL_PTX_TEMPLATE.
>>> static constexpr int KERNEL_ARITY = 2;
>>>
>>> // The name of the kernel described in KERNEL_PTX.
>>> static constexpr const char *KERNEL_NAME = "add_mystery_value";
>>>
>>> // The value added to the input in the kernel described in
>>> KERNEL_PTX.
>>> static constexpr float MYSTERY_VALUE = 123.0f;
>>>
>>> int main(int argc, char *argv[]) {
>>> // Get a CUDA Platform object. (Other platforms such as OpenCL are
>>> also
>>> // supported.)
>>> se::Platform *platform =
>>>
>>> se::MultiPlatformManager::PlatformWithName("cuda").ValueOrDie();
>>>
>>> // Get a StreamExecutor for the chosen Platform. Multiple devices
>>> are
>>> // supported, we indicate here that we want to run on device 0.
>>> const int device_ordinal = 0;
>>> se::StreamExecutor *executor =
>>> platform->ExecutorForDevice(device_ordinal).ValueOrDie();
>>>
>>> // Create a MultiKernelLoaderSpec, which knows where to find the
>>> code for our
>>> // kernel. In this case, the code is stored in memory as a PTX
>>> string.
>>> //
>>> // Note that the "arity" and name specified here must match
>>> "arity" and name
>>> // of the kernel defined in the PTX string.
>>> se::MultiKernelLoaderSpec kernel_loader_spec(KERNEL_ARITY);
>>> kernel_loader_spec.AddCudaPtxInMemory(KERNEL_PTX, KERNEL_NAME);
>>>
>>> // Next create a kernel handle, which we will associate with our
>>> kernel code
>>> // (i.e., the PTX string). The type of this handle is a bit
>>> verbose, so we
>>> // create an alias for it.
>>> //
>>> // This specific type represents a kernel that takes two
>>> arguments: a floating
>>> // point value and a pointer to a floating point value in device
>>> memory.
>>> //
>>> // A type like this is nice to have because it enables static type
>>> checking of
>>> // kernel arguments when we enqueue work on a stream.
>>> using KernelType = se::TypedKernel<float, se::DeviceMemory<float>
>>> *>;
>>>
>>> // Now instantiate an object of the specific kernel type we
>>> declared above.
>>> // The kernel object is not yet connected with the device code
>>> that we want it
>>> // to run (that happens with the call to GetKernel below), so it
>>> cannot be
>>> // used to execute work on the device yet.
>>> //
>>> // However, the kernel object is not completely empty when it is
>>> created. From
>>> // the StreamExecutor passed into its constructor it knows which
>>> platform it
>>> // is targeted for, and it also knows which device it will run on.
>>> KernelType kernel(executor);
>>>
>>> // Use the MultiKernelLoaderSpec defined above to load the kernel
>>> code onto
>>> // the device pointed to by the kernel object and to make that
>>> kernel object a
>>> // handle to the kernel code loaded on that device.
>>> //
>>> // The MultiKernelLoaderSpec may contain code for several
>>> different platforms,
>>> // but the kernel object has an associated platform, so there is
>>> no confusion
>>> // about which code should be loaded.
>>> //
>>> // After this call the kernel object can be used to launch its
>>> kernel on its
>>> // device.
>>> executor->GetKernel(kernel_loader_spec, &kernel);
>>>
>>> // Allocate memory in the device memory space to hold the result
>>> of the kernel
>>> // call. This memory will be freed when this object goes out of
>>> scope.
>>> se::ScopedDeviceMemory<float> result =
>>> executor->AllocateOwnedScalar<float>();
>>>
>>> // Create a stream on which to schedule device operations.
>>> se::Stream stream(executor);
>>>
>>> // Schedule a kernel launch on the new stream and block until the
>>> kernel
>>> // completes. The kernel call executes asynchronously on the
>>> device, so we
>>> // could do more work on the host before calling
>>> BlockHostUntilDone.
>>> const float kernel_input_argument = 42.5f;
>>> stream.Init()
>>> .ThenLaunch(se::ThreadDim(), se::BlockDim(), kernel,
>>> kernel_input_argument, result.ptr())
>>> .BlockHostUntilDone();
>>>
>>> // Copy the result of the kernel call from device back to the host.
>>> float host_result = 0.0f;
>>> executor->SynchronousMemcpyD2H(result.cref(), sizeof(host_result),
>>> &host_result);
>>>
>>> // Verify that the correct result was computed.
>>> assert((kernel_input_argument + MYSTERY_VALUE) == host_result);
>>> }
>>>
>>>
>>> --------------------------------
>>> Kernel Loader Specs
>>> --------------------------------
>>>
>>> An instance of the class `MultiKernelLoaderSpec` is used to encapsulate
>>> knowledge of where the device code for a kernel is stored and what format
>>> it is in. Given a `MultiKernelLoaderSpec` and an uninitialized
>>> `TypedKernel`, calling the `StreamExecutor::GetKernel` method will load the
>>> code onto the device and associate the `TypedKernel` instance with that
>>> loaded code. So, in order to initialize a `TypedKernel` instance, it is
>>> first necessary to create a `MultiKernelLoaderSpec`.
>>>
>>> A `MultiKernelLoaderSpec` supports a different method for adding device
>>> code
>>> for each combination of platform, format, and storage location. The
>>> following
>>> table shows some examples:
>>>
>>> =========== ======= =========== =========================
>>> Platform Format Location Setter
>>> =========== ======= =========== =========================
>>> CUDA PTX disk `AddCudaPtxOnDisk`
>>> CUDA PTX memory `AddCudaPtxInMemory`
>>> CUDA cubin disk `AddCudaCubinOnDisk`
>>> CUDA cubin memory `AddCudaCubinInMemory`
>>> OpenCL text disk `AddOpenCLTextOnDisk`
>>> OpenCL text memory `AddOpenCLTextInMemory`
>>> OpenCL binary disk `AddOpenCLBinaryOnDisk`
>>> OpenCL binary memory `AddOpenCLBinaryInMemory`
>>> =========== ======= =========== =========================
>>>
>>> The specific method used in the example is `AddCudaPtxInMemory`, but all
>>> other methods are used similarly.
>>>
>>>
>>> ------------------------------------
>>> Compiler Support for StreamExecutor
>>> ------------------------------------
>>>
>>>
>>> General strategies
>>> -------------------
>>>
>>> For illustrative purposes, the PTX code in the example is written by
>>> hand and appears as a string literal in the source code file, but it is far
>>> more typical for the kernel code to be expressed in a high level language
>>> like CUDA C++ or OpenCL C and for the device machine code to be generated
>>> by a compiler.
>>>
>>> There are several ways we can load compiled device code using
>>> StreamExecutor.
>>>
>>> One possibility is that the build system could write the compiled device
>>> code to a file on disk. This can then be added to a `MultiKernelLoaderSpec`
>>> by using one of the `OnDisk` setters.
>>>
>>> Another option is to add a feature to the compiler which embeds the
>>> compiled device code into the host executable and provides some symbol
>>> (probably with a name based on the name of the kernel) that allows the host
>>> code to refer to the embedded code data.
>>>
>>> In fact, as discussed below, in the current use of StreamExecutor inside
>>> Google, the compiler goes even further and generates an instance of
>>> `MultiKernelLoaderSpec` for each kernel. This means the application author
>>> doesn't have to know anything about how or where the compiler decided to
>>> store the compiled device code, but instead gets a pre-made loader object
>>> that handles all those details.
>>>
>>>
>>> Compiler-generated code makes things safe
>>> --------------------------------------------
>>>
>>> Two of the steps in the example above are dangerous because they lack
>>> static safety checks: instantiating the `MultiKernelLoaderSpec` and
>>> specializing the `TypedKernel` class template. This section discusses how
>>> compiler support for StreamExecutor can make these steps safe.
>>>
>>> Instantiating a `MultiKernelLoaderSpec` requires specifying a three
>>> things:
>>>
>>> 1. the kernel *arity* (number of parameters),
>>> 2. the kernel name,
>>> 3. a string containing the device machine code for the kernel (either as
>>> assembly, or some sort of object file).
>>>
>>> The problem with this is that the kernel name and the number of
>>> parameters is already fully determined by the kernel's machine code. In the
>>> best case scenario the *arity* and name arguments passed to the
>>> `MultiKernelLoaderSpec` methods match the information in the machine code
>>> and are simply redundant, but in the worst case these arguments contradict
>>> the information in the machine code and we get a runtime error when we try
>>> to load the kernel..
>>>
>>> The second unsafe operation is specifying the kernel parameter types as
>>> type arguments to the `TypedKernel` class template. The specified types
>>> must match the types defined in the kernel machine code, but again there is
>>> no compile-time checking that these types match. Failure to match these
>>> types will result in a runtime error when the kernel is launched.
>>>
>>> We would like the compiler to perform these checks for the application
>>> author, so as to eliminate this source of runtime errors. In particular, we
>>> want the compiler to create an appropriate `MultiKernelLoaderSpec` instance
>>> and `TypedKernel` specialization for each kernel definition.
>>>
>>> One of the main goals of open-sourcing StreamExecutor is to let us add
>>> this code generation capability to Clang, when the user has chosen to use
>>> StreamExecutor as their runtime for accelerator operations.
>>>
>>> Google has been using an internally developed CUDA compiler based on
>>> Clang called **gpucc** that generates code for StreamExecutor in this way.
>>> The code below shows how the example above would be written using gpucc to
>>> generate the unsafe parts of the code.
>>>
>>> The kernel is defined in a high-level language (CUDA C++ in this
>>> example) in its own file:
>>>
>>> .. code-block:: c++
>>>
>>> // File: add_mystery_value.cu
>>>
>>> __global__ void add_mystery_value(float input, float *output) {
>>> *output = input + 42.0f;
>>> }
>>>
>>> The host code is defined in another file:
>>>
>>> .. code-block:: c++
>>>
>>> // File: example_host_code.cc
>>>
>>> #include <cassert>
>>>
>>> #include "stream_executor.h"
>>>
>>> // This header is generated by the gpucc compiler and it contains the
>>> // definitions of gpucc::kernel::AddMysteryValue and
>>> // gpucc::spec::add_mystery_value().
>>> //
>>> // The name of this header file is derived from the name of the file
>>> containing
>>> // the kernel code. The trailing ".cu" is replaced with ".gpu.h".
>>> #include "add_mystery_value.gpu.h"
>>>
>>> namespace se = streamexecutor;
>>>
>>> int main(int argc, char *argv[]) {
>>> se::Platform *platform =
>>>
>>> se::MultiPlatformManager::PlatformWithName("cuda").ValueOrDie();
>>>
>>> const int device_ordinal = 0;
>>> se::StreamExecutor *executor =
>>> platform->ExecutorForDevice(device_ordinal).ValueOrDie();
>>>
>>> // AddMysteryValue is an instance of TypedKernel generated by
>>> gpucc. The
>>> // template arguments are chosen by the compiler to match the
>>> parameters of
>>> // the add_mystery_value kernel.
>>> gpucc::kernel::AddMysteryValue kernel(executor);
>>>
>>> // gpucc::spec::add_mystery_value() is generated by gpucc. It
>>> returns a
>>> // MultiKernelLoaderSpec that knows how to find the compiled code
>>> for the
>>> // add_mystery_value kernel.
>>> executor->GetKernel(gpucc::spec::add_mystery_value(), &kernel);
>>>
>>> se::ScopedDeviceMemory<float> result =
>>> executor->AllocateOwnedScalar<float>();
>>> se::Stream stream(executor);
>>>
>>> const float kernel_input_argument = 42.5f;
>>>
>>> stream.Init()
>>> .ThenLaunch(se::ThreadDim(), se::BlockDim(), kernel,
>>> kernel_input_argument, result.ptr())
>>> .BlockHostUntilDone();
>>>
>>> float host_result = 0.0f;
>>> executor->SynchronousMemcpyD2H(result.cref(), sizeof(host_result),
>>> &host_result);
>>>
>>> assert((kernel_input_argument + 42.0f) == host_result);
>>> }
>>>
>>> This support from the compiler makes the use of StreamExecutor safe and
>>> easy.
>>>
>>>
>>> Compiler support for triple angle bracket kernel launches
>>> ----------------------------------------------------------
>>>
>>> For even greater ease of use, Google's gpucc CUDA compiler also supports
>>> an integrated mode that looks like NVIDIA's `CUDA programming model`_,which
>>> uses triple angle brackets (`<<<>>>`) to launch kernels.
>>>
>>> .. _CUDA programming model:
>>> http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#kernels
>>>
>>> .. code-block:: c++
>>> :emphasize-lines: 22
>>>
>>> #include <cassert>
>>>
>>> #include "stream_executor.h"
>>>
>>> namespace se = streamexecutor;
>>>
>>> __global__ void add_mystery_value(float input, float *output) {
>>> *output = input + 42.0f;
>>> }
>>>
>>> int main(int argc, char *argv[]) {
>>> se::Platform *platform =
>>>
>>> se::MultiPlatformManager::PlatformWithName("cuda").ValueOrDie();
>>>
>>> const int device_ordinal = 0;
>>> se::StreamExecutor *executor =
>>> platform->ExecutorForDevice(device_ordinal).ValueOrDie();
>>>
>>> se::ScopedDeviceMemory<float> result =
>>> executor->AllocateOwnedScalar<float>();
>>>
>>> const float kernel_input_argument = 42.5f;
>>> add_mystery_value<<<1, 1>>>(kernel_input_argument, *result.ptr());
>>>
>>> float host_result = 0.0f;
>>> executor->SynchronousMemcpyD2H(result.cref(), sizeof(host_result),
>>> &host_result);
>>>
>>> assert((kernel_input_argument + 42.0f) == host_result);
>>> }
>>>
>>> Under the hood, gpucc converts the triple angle bracket kernel call into
>>> a series of calls to the StreamExecutor library similar to the calls seen
>>> in the previous examples.
>>>
>>> Clang currently supports the triple angle bracket kernel call syntax for
>>> CUDA compilation by replacing a triple angle bracket call with calls to the
>>> NVIDIA CUDA runtime library, but it would be easy to add a compiler flag to
>>> tell Clang to emit calls to the StreamExecutor library instead. There are
>>> several benefits to supporting this mode of compilation in Clang:
>>>
>>> .. _benefits-of-streamexecutor:
>>>
>>> * StreamExecutor is a high-level, modern C++ API, so is easier to use
>>> and less prone to error than the NVIDIA CUDA runtime and the OpenCL runtime.
>>> * StreamExecutor will be open-source software, so GPU code will not have
>>> to depend on opaque binary blobs like the NVIDIA CUDA runtime library.
>>> * Using StreamExecutor as the runtime would allow for easy extension of
>>> the triple angle bracket kernel launch syntax to support different
>>> accelerator programming models.
>>>
>>>
>>> Supporting other platforms
>>> ===========================
>>>
>>> StreamExecutor currently supports CUDA and OpenCL platforms
>>> out-of-the-box, but it uses a platform plugin architecture that makes it
>>> easy to add new platforms at any time. The CUDA and OpenCL platforms are
>>> both implemented as platform plugins in this way, so they serve as good
>>> examples for future platform developers of how to write these kinds of
>>> plugins.
>>>
>>>
>>> Canned operations
>>> ==================
>>>
>>> StreamExecutor provides several predefined kernels for common
>>> data-parallel operations. The supported classes of operations are:
>>>
>>> * BLAS: basic linear algebra subprograms,
>>> * DNN: deep neural networks,
>>> * FFT: fast Fourier transforms, and
>>> * RNG: random number generation.
>>>
>>> Here is an example of using a canned operation to perform random number
>>> generation:
>>>
>>> .. code-block:: c++
>>> :emphasize-lines: 12-13,17,34-35
>>>
>>> #include <array>
>>>
>>> #include "cuda/cuda_rng.h"
>>> #include "stream_executor.h"
>>>
>>> namespace se = streamexecutor;
>>>
>>> int main(int argc, char *argv[]) {
>>> se::Platform *platform =
>>>
>>> se::MultiPlatformManager::PlatformWithName("cuda").ValueOrDie();
>>>
>>> se::PluginConfig plugin_config;
>>> plugin_config.SetRng(se::cuda::kCuRandPlugin);
>>>
>>> const int device_ordinal = 0;
>>> se::StreamExecutor *executor =
>>> platform->ExecutorForDeviceWithPluginConfig(device_ordinal,
>>> plugin_config)
>>> .ValueOrDie();
>>>
>>> const uint8 seed[] = {0x0, 0x1, 0x2, 0x3, 0x4, 0x5, 0x6, 0x7,
>>> 0x8, 0x9, 0xa, 0xb, 0xc, 0xd, 0xe, 0xf};
>>> constexpr uint64 random_element_count = 1024;
>>>
>>> using HostArray = std::array<float, random_element_count>;
>>>
>>> HostArray host_memory;
>>> const size_t data_size = host_memory.size() *
>>> sizeof(HostArray::value_type);
>>>
>>> se::ScopedDeviceMemory<float> device_memory =
>>> executor->AllocateOwnedArray<float>(random_element_count);
>>>
>>> se::Stream stream(executor);
>>> stream.Init()
>>> .ThenSetRngSeed(seed, sizeof(seed))
>>> .ThenPopulateRandUniform(device_memory.ptr())
>>> .BlockHostUntilDone();
>>>
>>> executor->SynchronousMemcpyD2H(*device_memory.ptr(), data_size,
>>> host_memory.data());
>>> }
>>>
>>> Each platform plugin can define its own canned operation plugins for
>>> these operations or choose to leave any of them unimplemented.
>>>
>>>
>>> Contrast with OpenMP
>>> =====================
>>>
>>> Recent versions of OpenMP also provide a high-level, easy-to-use
>>> interface for running data-parallel workloads on an accelerator device. One
>>> big difference between OpenMP's approach and that of StreamExecutor is that
>>> OpenMP generates both the kernel code that runs on the device and the
>>> host-side code needed to launch the kernel, whereas StreamExecutor only
>>> generates the host-side code. While the OpenMP model provides the
>>> convenience of allowing the author to write their kernel code in standard
>>> C/C++, the StreamExecutor model allows for the use of any kernel language
>>> (e.g. CUDA C++ or OpenCL C). This lets authors use platform-specific
>>> features that are only present in platform-specific kernel definition
>>> languages.
>>>
>>> The philosophy of StreamExecutor is that performance is critical on the
>>> device, but less so on the host. As a result, no attempt is made to use a
>>> high-level device abstraction during device code generation. Instead, the
>>> high-level abstraction provided by StreamExecutor is used only for the
>>> host-side code that moves data and launches kernels. This host-side work
>>> is tedious and is not performance critical, so it benefits from being
>>> wrapped in a high-level library that can support a wide range of platforms
>>> in an easily extensible manner.
>>>
>>>
>>> Cooperation with OpenMP
>>> ========================
>>>
>>> The Clang OpenMP community is currently in the process of `designing
>>> their implementation`_ of offloading support. They will want the compiler
>>> to convert the various standardized target-oriented OpenMP pragmas into
>>> device code to execute on an accelerator and host code to load and run that
>>> device code. StreamExecutor may provide a convenient API for OpenMP to use
>>> to generate their host-side code.
>>>
>>> .. _designing their implementation:
>>> https://drive.google.com/a/google.com/file/d/0B-jX56_FbGKRM21sYlNYVnB4eFk/view
>>>
>>> In addition to the :ref:`benefits<benefits-of-streamexecutor>` that all
>>> users of StreamExecutor enjoy over the alternative host-side runtime
>>> libraries, OpenMP and StreamExecutor may mutually benefit by sharing work
>>> to support new platforms. If OpenMP makes use of StreamExecutor, then it
>>> should be simple for OpenMP to add support for any new platforms that
>>> StreamExecutor supports in the future. Similarly, for any platforms OpenMP
>>> would like to target, they may add that support in StreamExecutor and take
>>> advantage of the knowledge of platform support in the StreamExecutor
>>> community. The resulting new platform support would then be available not
>>> just within OpenMP, but also to any user of StreamExecutor.
>>>
>>> Although OpenMP and StreamExecutor support different programming models,
>>> some of the work they perform under the hood will likely be very similar.
>>> By sharing code and domain expertise, both projects will be improved and
>>> strengthened as their capabilities are expanded. The StreamExecutor
>>> community looks forward to much collaboration and discussion with OpenMP
>>> about the best places and ways to cooperate.
>>>
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>>
>>>
>>>
>>> --
>>> Hal Finkel
>>> Assistant Computational Scientist
>>> Leadership Computing Facility
>>> Argonne National Laboratory
>>>
>>
>>
>>
>> --
>> Hal Finkel
>> Assistant Computational Scientist
>> Leadership Computing Facility
>> Argonne National Laboratory
>>
>
>
>
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160310/da018d56/attachment-0001.html>
More information about the llvm-dev
mailing list