[cfe-dev] RFC: Proposing an LLVM subproject for parallelism runtime and support libraries

Wed Mar 9 22:37:49 PST 2016

----- Original Message -----
> From: "C Bergström via cfe-dev" <cfe-dev at lists.llvm.org>
> To: "Jason Henline" <jhen at google.com>
> Cc: "clang developer list" <cfe-dev at lists.llvm.org>
> Sent: Wednesday, March 9, 2016 10:33:25 PM
> Subject: Re: [cfe-dev] RFC: Proposing an LLVM subproject for parallelism runtime and support libraries
> 
> I think my comments are more generic than for/against this sort of
> proposal - I hope it helps start a discussion in general
> 
> On Thu, Mar 10, 2016 at 7:50 AM, Jason Henline <jhen at google.com>
> wrote:
> > Thanks for your interest. I think you bring up some very good
> > questions.
> >
> > 1) How well does this align with what the C++ standard is doing for
> > accelerator parallelism?
> >
> > I think that StreamExecutor will basically live independently from
> > any
> > accelerator-specific changes to the C++ standard. StreamExecutor
> > only wraps
> > the host-side code that launches kernels and has no real opinion
> > about how
> > those kernels are created. If C++ introduces annotations or other
> > constructs
> > to allow functions or blocks to be run on an accelerator, I would
> > expect
> > that C++ would then become another supported accelerator
> > programming
> > language, in the same way that CUDA and OpenCL are now currently
> > supported.
> >
> > 2) Do you have any benchmarks showing how much it costs to use this
> > wrapper vs bare cuda
> >
> > I think the appropriate comparison here would be between
> > StreamExecutor and
> > the Nvidia CUDA runtime library. I don't have numbers for that
> > comparison,
> > but we do measure the time spent in StreamExecutor calls as a
> > fraction of
> > the total runtime of several of our real applications. In those
> > measurements, we find that the StreamExecutor calls take up less
> > than 1% of
> > the total runtime, so we have been satisfied with that level of
> > performance
> > so far.
> >
> > 3) What sort of changes would exactly be needed inside clang/llvm
> > to
> > make it do what you need
> >
> > The changes would all be inside of clang. Clang already supports
> > compiling
> > CUDA code by lowering to calls to the Nvidia CUDA runtime library.
> > We would
> > introduce a new option into clang for using the StreamExecutor
> > library
> > instead. There are some changes that need to be made to Sema
> > because it is
> > currently hardcoded to look for a Nvidia CUDA runtime library
> > function with
> > a specific name in order to determine which types are allowed as
> > arguments
> > in the CUDA triple angle bracket launch syntax. Then there would be
> > changes
> > to CodeGen to optionally lower CUDA kernel calls onto
> > StreamExecutor library
> > calls, whereas now they are lowered to Nvidia CUDA runtime library
> > calls.
> 
> Sorry - just trying to get implementation details here
> 
> So it's pure C++ syntax exposed to the user, but your runtime. Is
> there "CUDA" or OpenCL hidden in the headers and that's where the
> actual offload portion is happening
> 
> Is there anything stopping you from exposing "wrapper" interfaces
> which are the same as the NVIDIA runtime? To avoid overhead you can
> just force inline them.
> 
> Where is the StreamExecutor runtime source now? Does StreamExecutor
> wrapper around public or private CUDA/OpenCL runtimes?
> 
> /*
> I have said this before and I really get uncomfortable with the
> generic term "CUDA" in clang. Until someone from NVIDIA (lawyers) put
> something in writing. CUDA is an NV trademark and clang/llvm project
> can't claim to be "CUDA" and need to make a distinction. Informally
> this is all friendly now, but I do hope it's officially clarified at
> some point. Maybe it's as simple as saying "CUDA compatible" - I
> don't
> know..
> */
> 
> >
> > 4) How is this different from say Thrust, AMD's wrapper libs,
> > Raja.. etc
> >
> > My understanding is that Thrust only supports STL operations,
> > whereas
> > StreamExecutor will support general user-defined kernels.
> >
> > I'm not personally familiar with AMD's wrapper libs or Raja. If you
> > have
> > links to point me in the right direction, I would be happy to
> > comment on any
> > similarities or differences.
> >
> > 5) Does it handle collapse, reductions and complex types?
> >
> > I don't think I fully understand the question here. If you mean
> > reductions
> > in the sense of the generic programming operation, there is no
> > direct
> > support. A user would currently have to write their own kernel for
> > that, but
> > StreamExecutor does support some common "canned" operations and
> > that set of
> > operations could be extended to include reductions.
> >
> > For complex types, the support will depend on what the kernel
> > language (such
> > as CUDA or OpenCL) supports. StreamExecutor will basically just
> > treat the
> > data as bytes and shuttle them to and from the accelerator as
> > needed.
> 
> I think having a nice model that lowers cleanly (high performance) to
> at least some targets is (should be) very important. From my
> experience - if you have complex or perfectly nested loops - how
> would
> you take this sort of algorithm and map it to StreamExecutor? Getting
> reductions right or wrong can also have a performance impact - If
> your
> goal is to create a "one wrapper rules them all" approach - I'm
> hoping
> you can find a common way to also make it easier for basic needs to
> be
> expressed to the underlying target. (In a target agnostic way)
> ------------
> Hal's question
> 
> Unified memory -  I can't see this solving much of anything. Most of
> the roadmaps I have seen will introduce high bandwidth memory, which
> isn't unified if you want best performance, at some point in the near
> future. So your latencies will admittedly change (hopefully for the
> better), but to really program with performance in mind - there's
> still going to be multiple layers of memory which should be
> considered
> for data movement.

While I'm going to withhold judgment until the relevant future hardware arrives, I'm inclined to agree with you. Using unified memory, in raw form, will probably not give you the best performance. That is, however, perhaps not the point. Many applications have complex configuration data structures that need to be shared between host and devices, and unified memory can map those transparently. For the remaining data, for which the transfers are performance sensitive, you'll want to explicitly manage the transfers (or at least hint to the driver to transfer the data ahead of time). The overall result, however, should be significantly simpler code.

 -Hal

> 
> In early generation shared memory gpu systems (please take the
> following statement with a grain of salt) - I have (first hand)
> measured 10-15% performance differences when the part was in shared
> memory vs "discrete" mode.
> 
> However, I do think it's very important to be able to express some
> level of locality or write code in a way which reduces the amount of
> data dependencies the compiler can't resolve. Whether the exposed or
> underlying execution model (religion) is threads/stream/tasks.. etc
> if
> the compiler can resolve data dependencies - it gives a level of
> independence for execution which the compiler can decide best.
> 
> Basically - the loop or parallel code is "independently" executable
> or
> has some dependency which needs to be resolved.
> 
> Microsoft did a really nice job of documenting C++AMP - Does google
> have a bunch of example codes which show how StreamExecutor can be
> used to implement various algorithms?
> ------------
> Personally, I'd be very supportive for this if it gets market
> adoption. There's a lot of popstar or one-hit wonders which have
> existed and died in this area over the past few years. Does
> clang/llvm
> accept anything or is there some metric for generally deciding what
> should get a sub-project and what just is too early. /* In Apache
> land
> this would be called an "incubator" project before being formally
> accepted */
> 
> /* From my skewed perspective - I really hope we (general market)
> don't end up with 10 different flavors of StreamExecutor lowering
> down
> to proprietary or pseudo "open standards" and all competing. Having
> buy-in from Intel/NVIDIA/AMD/ARM would really help make this a
> success. Does Google have a plan to engage and bring other
> stakeholders into supporting this? */
> 
> I hope all my questions are viewed as positive and meant to be
> constructive.
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory