[cfe-dev] RFC: Proposing an LLVM subproject for parallelism runtime and support libraries

Wed Mar 9 20:33:25 PST 2016

I think my comments are more generic than for/against this sort of
proposal - I hope it helps start a discussion in general

On Thu, Mar 10, 2016 at 7:50 AM, Jason Henline <jhen at google.com> wrote:
> Thanks for your interest. I think you bring up some very good questions.
>
> 1) How well does this align with what the C++ standard is doing for
> accelerator parallelism?
>
> I think that StreamExecutor will basically live independently from any
> accelerator-specific changes to the C++ standard. StreamExecutor only wraps
> the host-side code that launches kernels and has no real opinion about how
> those kernels are created. If C++ introduces annotations or other constructs
> to allow functions or blocks to be run on an accelerator, I would expect
> that C++ would then become another supported accelerator programming
> language, in the same way that CUDA and OpenCL are now currently supported.
>
> 2) Do you have any benchmarks showing how much it costs to use this
> wrapper vs bare cuda
>
> I think the appropriate comparison here would be between StreamExecutor and
> the Nvidia CUDA runtime library. I don't have numbers for that comparison,
> but we do measure the time spent in StreamExecutor calls as a fraction of
> the total runtime of several of our real applications. In those
> measurements, we find that the StreamExecutor calls take up less than 1% of
> the total runtime, so we have been satisfied with that level of performance
> so far.
>
> 3) What sort of changes would exactly be needed inside clang/llvm to
> make it do what you need
>
> The changes would all be inside of clang. Clang already supports compiling
> CUDA code by lowering to calls to the Nvidia CUDA runtime library. We would
> introduce a new option into clang for using the StreamExecutor library
> instead. There are some changes that need to be made to Sema because it is
> currently hardcoded to look for a Nvidia CUDA runtime library function with
> a specific name in order to determine which types are allowed as arguments
> in the CUDA triple angle bracket launch syntax. Then there would be changes
> to CodeGen to optionally lower CUDA kernel calls onto StreamExecutor library
> calls, whereas now they are lowered to Nvidia CUDA runtime library calls.

Sorry - just trying to get implementation details here

So it's pure C++ syntax exposed to the user, but your runtime. Is
there "CUDA" or OpenCL hidden in the headers and that's where the
actual offload portion is happening

Is there anything stopping you from exposing "wrapper" interfaces
which are the same as the NVIDIA runtime? To avoid overhead you can
just force inline them.

Where is the StreamExecutor runtime source now? Does StreamExecutor
wrapper around public or private CUDA/OpenCL runtimes?

/*
I have said this before and I really get uncomfortable with the
generic term "CUDA" in clang. Until someone from NVIDIA (lawyers) put
something in writing. CUDA is an NV trademark and clang/llvm project
can't claim to be "CUDA" and need to make a distinction. Informally
this is all friendly now, but I do hope it's officially clarified at
some point. Maybe it's as simple as saying "CUDA compatible" - I don't
know..
*/

>
> 4) How is this different from say Thrust, AMD's wrapper libs, Raja.. etc
>
> My understanding is that Thrust only supports STL operations, whereas
> StreamExecutor will support general user-defined kernels.
>
> I'm not personally familiar with AMD's wrapper libs or Raja. If you have
> links to point me in the right direction, I would be happy to comment on any
> similarities or differences.
>
> 5) Does it handle collapse, reductions and complex types?
>
> I don't think I fully understand the question here. If you mean reductions
> in the sense of the generic programming operation, there is no direct
> support. A user would currently have to write their own kernel for that, but
> StreamExecutor does support some common "canned" operations and that set of
> operations could be extended to include reductions.
>
> For complex types, the support will depend on what the kernel language (such
> as CUDA or OpenCL) supports. StreamExecutor will basically just treat the
> data as bytes and shuttle them to and from the accelerator as needed.

I think having a nice model that lowers cleanly (high performance) to
at least some targets is (should be) very important. From my
experience - if you have complex or perfectly nested loops - how would
you take this sort of algorithm and map it to StreamExecutor? Getting
reductions right or wrong can also have a performance impact - If your
goal is to create a "one wrapper rules them all" approach - I'm hoping
you can find a common way to also make it easier for basic needs to be
expressed to the underlying target. (In a target agnostic way)
------------
Hal's question

Unified memory -  I can't see this solving much of anything. Most of
the roadmaps I have seen will introduce high bandwidth memory, which
isn't unified if you want best performance, at some point in the near
future. So your latencies will admittedly change (hopefully for the
better), but to really program with performance in mind - there's
still going to be multiple layers of memory which should be considered
for data movement.

In early generation shared memory gpu systems (please take the
following statement with a grain of salt) - I have (first hand)
measured 10-15% performance differences when the part was in shared
memory vs "discrete" mode.

However, I do think it's very important to be able to express some
level of locality or write code in a way which reduces the amount of
data dependencies the compiler can't resolve. Whether the exposed or
underlying execution model (religion) is threads/stream/tasks.. etc if
the compiler can resolve data dependencies - it gives a level of
independence for execution which the compiler can decide best.

Basically - the loop or parallel code is "independently" executable or
has some dependency which needs to be resolved.

Microsoft did a really nice job of documenting C++AMP - Does google
have a bunch of example codes which show how StreamExecutor can be
used to implement various algorithms?
------------
Personally, I'd be very supportive for this if it gets market
adoption. There's a lot of popstar or one-hit wonders which have
existed and died in this area over the past few years. Does clang/llvm
accept anything or is there some metric for generally deciding what
should get a sub-project and what just is too early. /* In Apache land
this would be called an "incubator" project before being formally
accepted */

/* From my skewed perspective - I really hope we (general market)
don't end up with 10 different flavors of StreamExecutor lowering down
to proprietary or pseudo "open standards" and all competing. Having
buy-in from Intel/NVIDIA/AMD/ARM would really help make this a
success. Does Google have a plan to engage and bring other
stakeholders into supporting this? */

I hope all my questions are viewed as positive and meant to be constructive.