In our project we combine regular binary code and LLVM IR code for
kernels, embedded as a special data symbol of ELF object. The LLVM IR
for kernel existing at compile-time is preliminary, and may be optimized
further during runtime (pointers analysis, polly, etc.). During
application startup, runtime system builds an index of all kernels
sources embedded into the executable. Host and kernel code interact by
means of special "launch" call, which does not only
optimize&compile&execute the kernel, but first makes an
estimation if it is worth to, or better to fall back to host code
equivalent.<br>
<br>
Proposal made by Tobias is very elegant, but it seems to be addressing the
case when host and sub-architectures' code exist in the same time. May I
kindly point out that to our experience the really efficient deeply
specialized sub-architectures code may simply not exist at compile time,
while the generic baseline host code always can.<br><br>Best,<br>- Dima.<br><br><div class="gmail_quote">2012/7/26 Duncan Sands <span dir="ltr"><<a href="mailto:baldrick@free.fr" target="_blank">baldrick@free.fr</a>></span><br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi Tobias, I didn't really get it. Is the idea that the same bitcode is<br>
going to be codegen'd for different architectures, or is each sub-module<br>
going to contain different bitcode? In the later case you may as well<br>
just use multiple modules, perhaps in conjunction with a scheme to store<br>
more than one module in the same file on disk as a convenience.<br>
<br>
Ciao, Duncan.<br>
<div class="HOEnZb"><div class="h5"><br>
> a couple of weeks ago I discussed with Peter how to improve LLVM's<br>
> support for heterogeneous computing. One weakness we (and others) have<br>
> seen is the absence of multi-module support in LLVM. Peter came up with<br>
> a nice idea how to improve here. I would like to put this idea up for<br>
> discussion.<br>
><br>
> ## The problem ##<br>
><br>
> LLVM-IR modules can currently only contain code for a single target<br>
> architecture. However, there are multiple use cases where one<br>
> translation unit could contain code for several architectures.<br>
><br>
> 1) CUDA<br>
><br>
> cuda source files can contain both host and device code. The absence of<br>
> multi-module support complicates adding CUDA support to clang, as clang<br>
> would need to perform multi-module compilation on top of a single-module<br>
> based compiler framework.<br>
><br>
> 2) C++ AMP<br>
><br>
> C++ AMP [1] contains - similarly to CUDA - both host code and device<br>
> code in the same source file. Even if C++ AMP is a Microsoft extension<br>
> the use case itself is relevant to clang. It would be great if LLVM<br>
> would provide infrastructure, such that front-ends could easily target<br>
> accelerators. This would probably yield a lot of interesting experiments.<br>
><br>
> 3) Optimizers<br>
><br>
> To fully automatically offload computations to an accelerator an<br>
> optimization pass needs to extract the computation kernels and schedule<br>
> them as separate kernels on the device. Such kernels are normally<br>
> LLVM-IR modules for different architectures. At the moment, passes have<br>
> no way to create and store new LLVM-IR modules. There is also no way<br>
> to reference kernel LLVM-IR modules from a host module (which is<br>
> necessary to pass them to the accelerator run-time).<br>
><br>
> ## Goals ##<br>
><br>
> a) No major changes to existing tools and LLVM based applications<br>
><br>
> b) Human readable and writable LLVM-IR<br>
><br>
> c) FileCheck testability<br>
><br>
> d) Do not force a specific execution model<br>
><br>
> e) Unlimited number of embedded modules<br>
><br>
> ## Detailed Goals<br>
><br>
> a)<br>
> o No changes should be required, if a tool does not use multi-module<br>
> support. Each LLVM-IR file valid today, should remain valid.<br>
><br>
> o Major tools should support basic heterogeneous modules without large<br>
> changes. Some of the commands that should work after smaller<br>
> adaptions:<br>
><br>
> clang -S -emit-llvm -o out.ll<br>
> opt -O3 out.ll -o out.opt.ll<br>
> llc out.opt.ll<br>
> lli out.opt.ll<br>
> bugpoint -O3 out.opt.ll<br>
><br>
> b) All (sub)modules should be directly human readable/writable.<br>
> There should be no need to extract single modules before modifying<br>
> them.<br>
><br>
> c) The LLVM-IR generated from a heterogeneous multi-module should<br>
> easily be 'FileCheck'able. The same is true, if a multi-module is<br>
> the result of an optimization.<br>
><br>
> d) In CUDA/OpenCL/C++ AMP kernels are scheduled from within the host<br>
> code. This means arbitrary host code can decide under which<br>
> conditions kernels are scheduled for execution. It is therefore<br>
> necessary to reference individual sub-modules from within the host<br>
> module.<br>
><br>
> e) CUDA/OpenCL allows to compile and schedule an arbitrary number of<br>
> kernels. We do not want to put an artificial limit on the number of<br>
> modules they are represented in. This means a single embedded<br>
> submodule is not enough.<br>
><br>
> ## Non Goals ##<br>
><br>
> o Modeling sub-architectures on a per-function basis<br>
><br>
> Functions could be specialized for a certain sub-architecture. This is<br>
> helpful to have certain functions optimized e.g. with AVX2 enabled, but<br>
> the general program being compiled for a more generic architecture.<br>
> We do not address per-function annotations in this proposal.<br>
><br>
> ## Proposed solution ##<br>
><br>
> To bring multi-module support to LLVM, we propose to add a new type<br>
> called 'llvmir' to LLVM-IR. It can be used to embed LLVM-IR submodules<br>
> as global variables.<br>
><br>
> ------------------------------------------------------------------------<br>
> target datalayout = ...<br>
> target triple = "x86_64-unknown-linux-gnu"<br>
><br>
> @llvm_kernel = private unnamed_addr constant llvm_kernel {<br>
> target triple = nvptx64-unknown-unknown<br>
> define internal ptx_kernel void @gpu_kernel(i8* %Array) {<br>
> ...<br>
> }<br>
> }<br>
> ------------------------------------------------------------------------<br>
><br>
> By default the global will be compiled to a llvm string stored in the<br>
> object file. We could also think about translating it to PTX or AMD's<br>
> HSA-IL, such that e.g. PTX can be passed to a run-time library.<br>
><br>
> From my point of view, Peters idea allows us to add multi-module<br>
> support in a way that allows us to reach the goals described above.<br>
> However, to properly design and implement it, early feedback would be<br>
> valuable.<br>
><br>
> Cheers<br>
> Tobi<br>
><br>
> [1] <a href="http://msdn.microsoft.com/en-us/library/hh265137%28v=vs.110%29" target="_blank">http://msdn.microsoft.com/en-us/library/hh265137%28v=vs.110%29</a><br>
> [2]<br>
> <a href="http://www.amd.com/us/press-releases/Pages/amd-arm-computing-innovation-2012june12.aspx" target="_blank">http://www.amd.com/us/press-releases/Pages/amd-arm-computing-innovation-2012june12.aspx</a><br>
> _______________________________________________<br>
> LLVM Developers mailing list<br>
> <a href="mailto:LLVMdev@cs.uiuc.edu">LLVMdev@cs.uiuc.edu</a> <a href="http://llvm.cs.uiuc.edu" target="_blank">http://llvm.cs.uiuc.edu</a><br>
> <a href="http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev" target="_blank">http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev</a><br>
><br>
<br>
_______________________________________________<br>
LLVM Developers mailing list<br>
<a href="mailto:LLVMdev@cs.uiuc.edu">LLVMdev@cs.uiuc.edu</a> <a href="http://llvm.cs.uiuc.edu" target="_blank">http://llvm.cs.uiuc.edu</a><br>
<a href="http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev" target="_blank">http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev</a><br>
</div></div></blockquote></div><br>