[LLVMdev] GSoC 2012 Proposal: Automatic GPGPU code generation for llvm

Tue Apr 3 19:59:32 PDT 2012

Hi Tobi,

I revise the proposal here. Can you review for me and give comments again?
Thanks.

*Abstract*
Very often, developing an GPGPU application is a time-consuming, complex,
error-prone and iterative process. In this project, I propose to build an
automatic GPGPU code generation framework for LLVM, based on two successful
LLVM (sub-)projects - Polly and PTX backend. This can be very useful to
ease the burden of the long learning curve of various GPU programming
model.

*Motivation*
With the broad proliferation of GPU computing, it is very important to
provide an easy and automatic tool to develop or port the applications to
GPU for normal developers, especially for those domain experts who want to
harness the huge computing power of GPU. Polly has implemented many
transformations, such as tiling, auto-vectorization and openmp code
generation. And GPGPU code generation has been planned in [1]. With the
help of LLVM's PTX backend, I plan to extend Polly with the feature of
GPGPU code generation.

*Project Detail*
There are several successful projects on source to source automatic gpu
code transformation. In this project, I will follow the method proposed in
[2] by Muthu Manikandan Baskaran etc. Since automatic GPGPU code generation
is quite a complex problem, specifically, we target two kinds of test
cases. One is comprised of pure parallel loops, just like the following
codes.

parfor(int i=0 to M)
  parfor(int j=0 to N)
    LoopBody(i, j);

Another one is that all the loops in it are parallel except the inner-most
one, just like this:

parfor(int i=0 to M)
  parfor(int j=0 to N)
    non-parfor(int k=0 to K)
      LoopBody(i, j, k);

The LoopBody part should be  limited to instructions or functions calls
(intrinsic) which can be handled by LLVM's PTX backend.

The work flow of our code generator is as follows. We first use Polly's
jscop file importer to get a wanted 4-level parallel tiled code. Then we
extract the loop body (or inner non-parallel loops) into a LLVM function,
tagging it with PTX_Kernel or PTX_Device call convention. Then  we use PTX
backend to translate the PTX_Kernel and PTX_Device functions into strings
of the corresponding PTX codes. After that we transformed non-translatable
part of the LLVM IRs with GPU runtime library calls inserted. The execution
configure of GPU is acquired from external user-specified jscop files,
which has been implemented by Polly. Finally, we provide an runtime library
to generate the executable program or run the optimized LLVM IRs with JIT
compiler like li.

There are two key challenges in this project.
1. How to automatically insert the synchronization codes.
This is very important to preserve the original semantics. We must detect
where we need insert them correctly.

2. How to automatically generate the memory copy operation between host and
device.
We must transport the input data to GPU and copy the
results back. Fortunately, Polly has implemented a very expressive way to
describe memory access. We will follow the taxonomy proposed in [3] by
Chris Gregg etc.

*Timeline*

   - May 21 ~ June 11 Preliminary GPGPU Code Generation

In this stage,  implement gpu code generation for 1d and 2d parallel loops
test cases which needn't to copy host memory as input. Verify  that our
method is workable.

   - June 12 ~ June 24 automatic memory copy insertions.

In this stage, insert memory copy operation for all the array accesses
correctly according to the Read/Write property provided by Polly.

   - June 25 ~ July 8 Code Generation for Parallel Loops With Non-parallel
   Inner-most Loop.

In this stage, implement gpgpu code generation for classical matrix
multiplication
test case.

*for*(i=0; i<N; i++) {

* for*(j=0; j<N; j++) {

* for*(k=0; k<N; k++)

 C[i][j] = C[i][j] + A[i][k] * B[k][j];

 }

}

   - July 9 ~ July 15 Midterm evaluation and writing documents.
   - July 16 ~ July 22 Automatic Synchronization Insertion.

In this stage, implement Muthu's method instroduced in Section 4.3 in [2] to
insert barrier synchronizations to preserve semantic-equivalent.

   - July 23 ~ August 5 Test on Polybench Benchmarks and Report Results.
   - August 6 ~ August 12 Summarize and Complete the Final Documents.

*Project Experience*
I participated in several projects related to binary translation
(optimization) and run-time system. And I implemented a frontend for
numerical computing languages like octave/matlab, following the style of
clang. Recently, I work very close with Polly team to contribute some
patches [4] and investigate lots of details about polyhedral
transformation.
*
*
*References*
1. Tobias Grosser, Ragesh A. *Polly - First Successful Optimizations - How
to proceed?* LLVM Developer Meeting 2011.
2. Muthu Manikandan Baskaran, J. Ramanujam and P. Sadayappan.* **Automatic
C-to-CUDA Code Generation for Affine Programs.* International Conference on
Compiler Construction (CC) 2010.
3. Chris Gregg and Kim Hazelwood. *Where is the Data? Why You Cannot Debate
GPU vs. CPU Performance Without the Answer**.** *International Symposium on
Performance Analysis of Systems and Software (ISPASS) 2011.
4. http://llvm.org/viewvc/llvm-project?view=rev&revision=153319.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120404/1f7b3e74/attachment.html>