<div dir="ltr">Thanks for your interest. I think you bring up some very good questions.<div dir="ltr"><div><br></div><div>1) How well does this align with what the C++ standard is doing for<br>accelerator parallelism?<br></div><div><br></div></div><div dir="ltr"><div>I think that StreamExecutor will basically live independently from any accelerator-specific changes to the C++ standard. StreamExecutor only wraps the host-side code that launches kernels and has no real opinion about how those kernels are created. If C++ introduces annotations or other constructs to allow functions or blocks to be run on an accelerator, I would expect that C++ would then become another supported accelerator programming language, in the same way that CUDA and OpenCL are now currently supported.<br></div></div><div dir="ltr"><div><br></div><div><div>2) Do you have any benchmarks showing how much it costs to use this</div><div>wrapper vs bare cuda</div></div><div><br></div></div><div dir="ltr"><div>I think the appropriate comparison here would be between StreamExecutor and the Nvidia CUDA runtime library. I don't have numbers for that comparison, but we do measure the time spent in StreamExecutor calls as a fraction of the total runtime of several of our real applications. In those measurements, we find that the StreamExecutor calls take up less than 1% of the total runtime, so we have been satisfied with that level of performance so far.<br></div></div><div dir="ltr"><div><br></div><div><div>3) What sort of changes would exactly be needed inside clang/llvm to</div><div>make it do what you need</div></div><div><br></div></div><div dir="ltr"><div><div>The changes would all be inside of clang. Clang already supports compiling CUDA code by lowering to calls to the Nvidia CUDA runtime library. We would introduce a new option into clang for using the StreamExecutor library instead. There are some changes that need to be made to Sema because it is currently hardcoded to look for a Nvidia CUDA runtime library function with a specific name in order to determine which types are allowed as arguments in the CUDA triple angle bracket launch syntax. Then there would be changes to CodeGen to optionally lower CUDA kernel calls onto StreamExecutor library calls, whereas now they are lowered to Nvidia CUDA runtime library calls.</div></div></div><div dir="ltr"><div><br></div><div>4) How is this different from say Thrust, AMD's wrapper libs, Raja.. etc<br></div><div><br></div></div><div dir="ltr"><div><div>My understanding is that Thrust only supports STL operations, whereas StreamExecutor will support general user-defined kernels.</div><div><br></div><div>I'm not personally familiar with AMD's wrapper libs or Raja. If you have links to point me in the right direction, I would be happy to comment on any similarities or differences.</div></div></div><div dir="ltr"><div><br></div><div>5) Does it handle collapse, reductions and complex types?<br></div><div><br></div></div><div dir="ltr"><div><div>I don't think I fully understand the question here. If you mean reductions in the sense of the generic programming operation, there is no direct support. A user would currently have to write their own kernel for that, but StreamExecutor does support some common "canned" operations and that set of operations could be extended to include reductions.</div><div><br></div><div>For complex types, the support will depend on what the kernel language (such as CUDA or OpenCL) supports. StreamExecutor will basically just treat the data as bytes and shuttle them to and from the accelerator as needed.</div></div></div><div dir="ltr"><div><br></div><div>6) On the CPU side does it just lower to pthreads?<br></div><div><br></div><div>Yes it is just pthreads under the hood. The host executor is not very clever at this point. It was mostly developed as a way for us to keep in mind how the interface would need to look for different platforms.<br></div></div><br><div class="gmail_quote"><div dir="ltr">On Wed, Mar 9, 2016 at 3:58 PM C Bergström <<a href="mailto:cbergstrom@pathscale.com" target="_blank">cbergstrom@pathscale.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On Thu, Mar 10, 2016 at 4:30 AM, Jason Henline via cfe-dev<br>

<<a href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a>> wrote:<br>

> At Google we're doing a lot of work on parallel programming models for CPUs,<br>

> GPUs and other platforms. One place where we're investing a lot are parallel<br>

> libraries, especially those closely tied to compiler technology like runtime<br>

> and math libraries. We would like to develop these in the open, and the<br>

> natural place seems to be as a subproject in LLVM if others in the community<br>

> are interested.<br>

><br>

> Initially, we'd like to open source our StreamExecutor runtime library,<br>

> which is used for simplifying the management of data-parallel workflows on<br>

> accelerator devices and can also be extended to support other hardware<br>

> platforms. We'd like to teach Clang to use StreamExecutor when targeting<br>

> CUDA and work on other integrations, but that makes much more sense if it is<br>

> part of the LLVM project.<br>

<br>

Sounds like a neat project!<br>

<br>

Some side questions to help with perspective<br>

1) How well does this align with what the C++ standard is doing for<br>

accelerator parallelism?<br>

2) Do you have any benchmarks showing how much it costs to use this<br>

wrapper vs bare cuda<br>

3) What sort of changes would exactly be needed inside clang/llvm to<br>

make it do what you need<br>

4) How is this different from say Thrust, AMD's wrapper libs, Raja.. etc<br>

5) Does it handle collapse, reductions and complex types?<br>

6) On the CPU side does it just lower to pthreads?<br>

</blockquote></div></div>