[cfe-dev] RFC: Proposing an LLVM subproject for parallelism runtime and support libraries

Wed Mar 9 16:50:07 PST 2016

Thanks for your interest. I think you bring up some very good questions.

1) How well does this align with what the C++ standard is doing for
accelerator parallelism?

I think that StreamExecutor will basically live independently from any
accelerator-specific changes to the C++ standard. StreamExecutor only wraps
the host-side code that launches kernels and has no real opinion about how
those kernels are created. If C++ introduces annotations or other
constructs to allow functions or blocks to be run on an accelerator, I
would expect that C++ would then become another supported accelerator
programming language, in the same way that CUDA and OpenCL are now
currently supported.

2) Do you have any benchmarks showing how much it costs to use this
wrapper vs bare cuda

I think the appropriate comparison here would be between StreamExecutor and
the Nvidia CUDA runtime library. I don't have numbers for that comparison,
but we do measure the time spent in StreamExecutor calls as a fraction of
the total runtime of several of our real applications. In those
measurements, we find that the StreamExecutor calls take up less than 1% of
the total runtime, so we have been satisfied with that level of performance
so far.

3) What sort of changes would exactly be needed inside clang/llvm to
make it do what you need

The changes would all be inside of clang. Clang already supports compiling
CUDA code by lowering to calls to the Nvidia CUDA runtime library. We would
introduce a new option into clang for using the StreamExecutor library
instead. There are some changes that need to be made to Sema because it is
currently hardcoded to look for a Nvidia CUDA runtime library function with
a specific name in order to determine which types are allowed as arguments
in the CUDA triple angle bracket launch syntax. Then there would be changes
to CodeGen to optionally lower CUDA kernel calls onto StreamExecutor
library calls, whereas now they are lowered to Nvidia CUDA runtime library
calls.

4) How is this different from say Thrust, AMD's wrapper libs, Raja.. etc

My understanding is that Thrust only supports STL operations, whereas
StreamExecutor will support general user-defined kernels.

I'm not personally familiar with AMD's wrapper libs or Raja. If you have
links to point me in the right direction, I would be happy to comment on
any similarities or differences.

5) Does it handle collapse, reductions and complex types?

I don't think I fully understand the question here. If you mean reductions
in the sense of the generic programming operation, there is no direct
support. A user would currently have to write their own kernel for that,
but StreamExecutor does support some common "canned" operations and that
set of operations could be extended to include reductions.

For complex types, the support will depend on what the kernel language
(such as CUDA or OpenCL) supports. StreamExecutor will basically just treat
the data as bytes and shuttle them to and from the accelerator as needed.

6) On the CPU side does it just lower to pthreads?

Yes it is just pthreads under the hood. The host executor is not very
clever at this point. It was mostly developed as a way for us to keep in
mind how the interface would need to look for different platforms.

On Wed, Mar 9, 2016 at 3:58 PM C Bergström <cbergstrom at pathscale.com> wrote:

> On Thu, Mar 10, 2016 at 4:30 AM, Jason Henline via cfe-dev
> <cfe-dev at lists.llvm.org> wrote:
> > At Google we're doing a lot of work on parallel programming models for
> CPUs,
> > GPUs and other platforms. One place where we're investing a lot are
> parallel
> > libraries, especially those closely tied to compiler technology like
> runtime
> > and math libraries. We would like to develop these in the open, and the
> > natural place seems to be as a subproject in LLVM if others in the
> community
> > are interested.
> >
> > Initially, we'd like to open source our StreamExecutor runtime library,
> > which is used for simplifying the management of data-parallel workflows
> on
> > accelerator devices and can also be extended to support other hardware
> > platforms. We'd like to teach Clang to use StreamExecutor when targeting
> > CUDA and work on other integrations, but that makes much more sense if
> it is
> > part of the LLVM project.
>
> Sounds like a neat project!
>
> Some side questions to help with perspective
> 1) How well does this align with what the C++ standard is doing for
> accelerator parallelism?
> 2) Do you have any benchmarks showing how much it costs to use this
> wrapper vs bare cuda
> 3) What sort of changes would exactly be needed inside clang/llvm to
> make it do what you need
> 4) How is this different from say Thrust, AMD's wrapper libs, Raja.. etc
> 5) Does it handle collapse, reductions and complex types?
> 6) On the CPU side does it just lower to pthreads?
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20160310/6463fcc3/attachment.html>