[LLVMdev] GSoC 2012 Proposal: Automatic GPGPU code generation for llvm

Wed Apr 4 07:35:43 PDT 2012

On 04/04/2012 04:17 PM, Justin Holewinski wrote:
>
>
> On Wed, Apr 4, 2012 at 4:49 AM, Tobias Grosser <tobias at grosser.es
> <mailto:tobias at grosser.es>> wrote:
>
>     On 04/03/2012 03:13 PM, Hongbin Zheng wrote:
>      > Hi Yabin,
>      >
>      > Instead of compile the LLVM IR to PTX asm string in a ScopPass, you
>      > can also the improve llc/lli or create new tools to support the code
>      > generation for Heterogeneous platforms[1], i.e. generate code for
>     more
>      > than one target architecture at the same time. Something like this is
>      > not very complicated and had been implemented[2,3] by some
>     people, but
>      > not available in LLVM mainstream. Implement this could make your GPU
>      > project more complete.
>
>     I agree with ether that we should ensure as much work as possible is
>     done within generic, not Polly specific code.
>
>
> Right, this has the potential to impact more people that the users of
> Polly. By moving as much as possible to generic LLVM, that
> infrastructure can be leveraged by people doing work outside of the
> polyhedral model.

To make stuff generic it is often helpful to know the other possible use 
cases. I consequently encourage everybody to point out such use cases or 
to state which exact functionality they might want to reuse. Otherwise, 
there it may happen that we focus a little too much on the needs of Polly.

>     In terms of heterogeneous code generation the approach Yabin proposed
>     seems to work, but we should discuss other approaches. For the moment,
>     I believe his proposal is very similar the model of OpenCL and CUDA. He
>     splits the code into host and kernel code. The host code is directly
>     compiled to machine code by the existing tools (clang/llc). The kernel
>     code is stored as a string and only at execution time it is compiled to
>     platform specific code.
>
>
> Depending on your target, that may be the only way.  If your target is
> OpenCL-compatible accelerators, then your only portable option is save
> the kernel code as OpenCL text and let the driver JIT compiler it at
> run-time.  Any other approach is not guaranteed to be compatible across
> platforms or even driver versions.
> In this case, the target is the CUDA Driver API, so you're free to pass
> along any valid PTX assembly.  In this case, you still pass the PTX code
> as a string to the driver, which JIT compiles it to actual GPU device
> code at run-time.

I would like to highlight that with the word 'string' I was not 
referring to 'OpenCL C code'. I don't think it is a practical approach 
to recover OpenCL C code, especially as the LLVM-IR C backend was 
recently removed.

I meant to describe that the kernel code is stored as a global variable 
in the host binary (in some intermediate representation such as LLVM-IR, 
PTX or a vendor specific OpenCLBinary) and is loaded at execution time 
into the OpenCL or CUDA runtime, where it is compiled down to hardware 
specific machine code.

>     Are there any other approaches that could be taken? What specific
>     heterogeneous platform support would be needed. At the moment, it seems
>     to me we actually do not need too much additional support.
>
>
> I could see this working without any additional support, if needed.  It
> seems like this proposal is dealing with LLVM IR -> LLVM IR code
> generation, so the only thing that is really needed is a way to split
> the IR into multiple separate IRs (one for host, and one for each
> accelerator target).  This does not really need any supporting
> infrastructure, as you could imagine an opt pass processing the input IR
> and transforming it to the host IR, and emitting the device IR as a
> separate module.

Yes. And instead of saving the two modules in separate files, we can 
store the kernel modul as a 'string' in the host module and add the 
necessary library calls to load it at run time. This will give a smooth 
user experience and requires almost no additional infrastructure.

(At the moment this will only work with NVidia, but I am confident there 
will be OpenCL vendor extensions that allow loading LLVM-IR kernels. AMD 
OpenCL can e.g. load LLVM-IR, even though it is not officially supported)

> Now if you're talking about source-level support for heterogeneous
> platforms (e.g. C++ AMP), then you would need to adapt Clang to support
> emission of multiple IR modules.  Basically, the AST would need to be
> split into host and device portions, and codegen'd appropriately.  I
> feel that is far beyond the scope of this proposal, though.

Yes. No source level transformations or targeting anything else than 
PTX, AMDIL or LLVM-IR.

Cheers
Tobi