[cfe-dev] [PROPOSAL] LLVM multi-module support

Wed Jul 25 23:35:29 PDT 2012

Hi,

a couple of weeks ago I discussed with Peter how to improve LLVM's 
support for heterogeneous computing. One weakness we (and others) have 
seen is the absence of multi-module support in LLVM. Peter came up with 
a nice idea how to improve here. I would like to put this idea up for 
discussion.

## The problem ##

LLVM-IR modules can currently only contain code for a single target 
architecture. However, there are multiple use cases where one 
translation unit could contain code for several architectures.

1) CUDA

cuda source files can contain both host and device code. The absence of 
multi-module support complicates adding CUDA support to clang, as clang 
would need to perform multi-module compilation on top of a single-module 
based compiler framework.

2) C++ AMP

C++ AMP [1] contains - similarly to CUDA - both host code and device 
code in the same source file. Even if C++ AMP is a Microsoft extension 
the use case itself is relevant to clang. It would be great if LLVM 
would provide infrastructure, such that front-ends could easily target 
accelerators. This would probably yield a lot of interesting experiments.

3) Optimizers

To fully automatically offload computations to an accelerator an 
optimization pass needs to extract the computation kernels and schedule
them as separate kernels on the device. Such kernels are normally 
LLVM-IR modules for different architectures. At the moment, passes have 
no way to create and store new LLVM-IR modules. There is also no way
to reference kernel LLVM-IR modules from a host module (which is 
necessary to pass them to the accelerator run-time).

## Goals ##

a) No major changes to existing tools and LLVM based applications

b) Human readable and writable LLVM-IR

c) FileCheck testability

d) Do not force a specific execution model

e) Unlimited number of embedded modules

## Detailed Goals

a)
  o No changes should be required, if a tool does not use multi-module
    support. Each LLVM-IR file valid today, should remain valid.

  o Major tools should support basic heterogeneous modules without large
    changes. Some of the commands that should work after smaller
    adaptions:

    clang -S -emit-llvm -o out.ll
    opt -O3 out.ll -o out.opt.ll
    llc out.opt.ll
    lli out.opt.ll
    bugpoint -O3 out.opt.ll

b) All (sub)modules should be directly human readable/writable.
    There should be no need to extract single modules before modifying
    them.

c) The LLVM-IR generated from a heterogeneous multi-module should
    easily be 'FileCheck'able. The same is true, if a multi-module is
    the result of an optimization.

d) In CUDA/OpenCL/C++ AMP kernels are scheduled from within the host
    code. This means arbitrary host code can decide under which
    conditions kernels are scheduled for execution. It is therefore
    necessary to reference individual sub-modules from within the host
    module.

e) CUDA/OpenCL allows to compile and schedule an arbitrary number of
    kernels. We do not want to put an artificial limit on the number of
    modules they are represented in. This means a single embedded
    submodule is not enough.

## Non Goals ##

o Modeling sub-architectures on a per-function basis

Functions could be specialized for a certain sub-architecture. This is 
helpful to have certain functions optimized e.g. with AVX2 enabled, but 
the general program being compiled for a more generic architecture.
We do not address per-function annotations in this proposal.

## Proposed solution ##

To bring multi-module support to LLVM, we propose to add a new type 
called 'llvmir' to LLVM-IR. It can be used to embed LLVM-IR submodules
as global variables.

------------------------------------------------------------------------
target datalayout = ...
target triple = "x86_64-unknown-linux-gnu"

@llvm_kernel = private unnamed_addr constant llvm_kernel {
   target triple = nvptx64-unknown-unknown
   define internal ptx_kernel void @gpu_kernel(i8* %Array) {
     ...
   }
}
------------------------------------------------------------------------

By default the global will be compiled to a llvm string stored in the 
object file. We could also think about translating it to PTX or AMD's 
HSA-IL, such that e.g. PTX can be passed to a run-time library.

 From my point of view, Peters idea allows us to add multi-module 
support in a way that allows us to reach the goals described above. 
However, to properly design and implement it, early feedback would be 
valuable.

Cheers
Tobi

[1] http://msdn.microsoft.com/en-us/library/hh265137%28v=vs.110%29
[2] 
http://www.amd.com/us/press-releases/Pages/amd-arm-computing-innovation-2012june12.aspx