Hi Tobi,<div><br></div><div>I revise the proposal here. Can you review for me and give comments again? Thanks.</div><div><br></div><div><h2 style><span style="font-size:medium"><strong>Abstract</strong></span></h2><div style>

<span style="font-size:medium"><span style="vertical-align:baseline;white-space:pre-wrap">Very often, developing an GPGPU application is a time-consuming, complex, error-prone and </span><span style="vertical-align:baseline;white-space:pre-wrap">iterative</span><span style="vertical-align:baseline;white-space:pre-wrap"> process. </span>In this project, I propose to build an automatic GPGPU code generation framework for LLVM, based on two successful LLVM (sub-)projects - Polly and PTX backend. This can be very useful to ease the burden of the long learning curve of various GPU programming model. </span></div>

<div style><span style="font-size:medium"> </span></div><h2 style><span style="font-size:medium"><strong>Motivation</strong></span></h2><div style><span style="font-size:medium">With the broad proliferation of GPU computing, it is very important to provide an easy and automatic tool to develop or port the applications to GPU for normal developers, especially for those domain experts who want to harness the huge computing power of GPU. Polly has implemented many transformations, such as tiling, auto-vectorization and openmp code generation. And GPGPU code generation has been planned in [1]. With the help of LLVM's PTX backend, I plan to extend Polly with the feature of GPGPU code generation.</span><span style="font-size:medium;background-color:transparent"> </span></div>

<div style><span style="font-size:medium;background-color:transparent"><br></span></div><h2 style><span style="font-size:medium"><strong>Project Detail</strong></span></h2><div style><div><span style="font-size:medium">There are several successful projects on source to source automatic gpu code transformation. In this project, I will follow the method proposed in [2] by Muthu Manikandan Baskaran etc. </span><span style="font-size:medium">Since automatic GPGPU code generation is quite a complex problem,</span><span style="font-size:medium"> </span><span style="font-size:medium">specifically, we target two kinds of test cases. One is comprised of pure parallel loops, just like the following codes.</span></div>

<blockquote><samp><span style="font-size:medium">parfor(int i=0 to M)</span></samp><samp><span style="font-size:medium"><br></span></samp><samp><span style="font-size:medium">  parfor(int j=0 to N)</span></samp><samp><span style="font-size:medium"><br>

</span></samp><samp><span style="font-size:medium">    LoopBody(i, j);</span></samp></blockquote><div><span style="font-size:medium">Another one is that all the loops in it are parallel except the inner-most one, just like this:</span></div>

<blockquote><samp><span style="font-size:medium">parfor(int i=0 to M)</span></samp><samp><span style="font-size:medium"><br></span></samp><samp></samp><samp><span style="font-size:medium">  parfor(int j=0 to N)</span></samp><samp><span style="font-size:medium"><br>

</span></samp><samp><span style="font-size:medium">    non-parfor(int k=0 to K)</span></samp><samp><span style="font-size:medium"><br></span></samp><samp><span style="font-size:medium">      LoopBody(i, j, k);</span></samp></blockquote>

<div><span style="font-size:medium">The LoopBody part should be  limited to instructions or functions calls (intrinsic) which can be handled by LLVM's PTX backend.</span></div><div><span style="font-size:medium"> </span></div>

</div><div style><span style="font-size:medium">The work flow of our code generator is as follows. We first use Polly's jscop file importer to get a wanted 4-level parallel tiled code. Then we extract the loop body (or inner non-parallel loops) into a LLVM function, tagging it with PTX_Kernel or PTX_Device call convention. Then  we use PTX backend to translate the PTX_Kernel and PTX_Device functions into strings of the corresponding PTX codes. After that we transformed non-translatable part of the LLVM IRs with GPU runtime library calls inserted. The execution configure of GPU is acquired from external user-specified jscop files, which has been implemented by Polly. Finally, we provide an runtime library to generate the executable program or run the optimized LLVM IRs with JIT compiler like li.</span></div>

<div style><span style="font-size:medium"> </span></div><div style><span style="font-size:medium">There are two key challenges in this project.</span></div><div style><span style="font-size:medium">1. How to automatically insert the synchronization codes.</span></div>

<div style><span style="font-size:medium">This is very important to preserve the original semantics. We must detect where we need insert them correctly.</span></div><div style><span style="font-size:medium"> </span></div>

<div style><div><span style="font-size:medium">2. How to automatically generate the memory copy operation between host and device.</span></div><div><span style="font-size:medium">We must transport the input data to GPU and copy the results back. Fortunately, Polly has implemented a very expressive way to describe memory access. We will follow the <span style="background-color:rgb(255,255,255);line-height:16px">taxonomy proposed in [3] by Chris Gregg etc.</span></span></div>

<div><span style="font-size:medium"><span style="background-color:rgb(255,255,255);line-height:16px"><br></span></span></div><h2 style="font-size:14px"><span style="font-size:medium"><strong>Timeline</strong></span></h2><div>

<ul><li><span style="font-size:medium">May 21 ~ June 11 Preliminary GPGPU Code Generation</span></li></ul></div><div><span style="font-size:medium">In this stage,  implement gpu code generation for 1d and 2d parallel loops test cases which needn't to copy host memory as input. Verify  that our method is workable.</span></div>

<div><span style="font-size:medium"><br></span></div><div><ul><li><span style="font-size:medium">June 12 ~ June 24 automatic memory copy insertions.</span></li></ul></div><div><span style="font-size:medium">In this stage, insert memory copy operation for all the array accesses correctly according to the Read/Write property provided by Polly.</span></div>

<div><span style="font-size:medium"><br></span></div><div><ul><li><span style="font-size:medium">June 25 ~ July 8 Code Generation for Parallel Loops With Non-parallel Inner-most Loop.</span></li></ul></div><div><span style="font-size:medium">In this stage, implement gpgpu code generation for classical matrix</span><span style="font-size:medium"> multiplication test case.</span></div>

<blockquote><samp><samp></samp></samp><pre style="color:rgb(0,0,0);font-family:Verdana,Arial,Helvetica,sans-serif;font-size:medium;margin-top:8px;margin-right:8px;margin-bottom:8px;margin-left:8px;background-color:rgb(255,255,255)">

<span><strong><span style="color:rgb(160,32,240)">for</span></strong>(i=0; i<N; i++) { </span></pre><pre style="color:rgb(0,0,0);font-family:Verdana,Arial,Helvetica,sans-serif;font-size:medium;margin-top:8px;margin-right:8px;margin-bottom:8px;margin-left:8px;background-color:rgb(255,255,255)">

<span><strong><span style="color:rgb(160,32,240)"> for</span></strong>(j=0; j<N; j++) { </span></pre><pre style="color:rgb(0,0,0);font-family:Verdana,Arial,Helvetica,sans-serif;font-size:medium;margin-top:8px;margin-right:8px;margin-bottom:8px;margin-left:8px;background-color:rgb(255,255,255)">

<span><strong><span style="color:rgb(160,32,240)"> for</span></strong>(k=0; k<N; k++)</span></pre><pre style="color:rgb(0,0,0);font-family:Verdana,Arial,Helvetica,sans-serif;font-size:medium;margin-top:8px;margin-right:8px;margin-bottom:8px;margin-left:8px;background-color:rgb(255,255,255)">

<span> C[i][j] = C[i][j] + A[i][k] * B[k][j];</span></pre><samp><samp></samp></samp><pre style="color:rgb(0,0,0);font-family:Verdana,Arial,Helvetica,sans-serif;font-size:medium;margin-top:8px;margin-right:8px;margin-bottom:8px;margin-left:8px;background-color:rgb(255,255,255)">

<span> }</span></pre><samp><samp></samp></samp><pre style="color:rgb(0,0,0);font-family:Verdana,Arial,Helvetica,sans-serif;font-size:medium;margin-top:8px;margin-right:8px;margin-bottom:8px;margin-left:8px;background-color:rgb(255,255,255)">

<span>}</span></pre></blockquote><div><ul><li><span style="font-size:medium">July 9 ~ July 15 Midterm evaluation and writing documents.</span></li><li><span style="font-size:medium">July 16 ~ July 22 Automatic Synchronization Insertion.</span></li>

</ul><div><span style="font-size:medium">In this stage, implement </span><span style="font-size:medium;line-height:19px">Muthu's method instroduced in Section 4.3 in [2]</span><span style="font-size:medium"> to insert barrier synchronizations to preserve semantic-equivalent.</span></div>

<ul><li><span style="font-size:medium">July 23 ~ August 5 Test on Polybench Benchmarks and Report Results.</span></li><li><span style="font-size:medium">August 6 ~ August 12 Summarize and Complete the Final Documents.</span></li>

</ul><h2><strong style="font-size:medium">Project Experience</strong></h2></div><div><span style="font-size:medium">I participated in several projects related to binary translation (optimization) and run-time system. And I implemented a frontend for numerical computing languages like octave/matlab, following the style of clang. Recently, I work very close with Polly team to contribute some patches [4] and investigate lots of details about polyhedral transformation. </span></div>

<div><strong style="font-size:medium"><br></strong></div><h2 style="font-size:14px"><strong style="font-size:medium">References</strong></h2><div><span style="font-size:medium"><span style="line-height:19px">1. </span><span style="line-height:19px">Tobias Grosser, Ragesh A. </span><span style="line-height:19px"><strong><em>Polly - First Successful Optimizations - How to proceed?</em></strong> </span><span style="line-height:19px">LLVM Developer Meeting 2011.</span></span></div>

<div><span style="font-size:medium"><span style="line-height:19px">2. </span><span style="line-height:19px">Muthu Manikandan Baskaran, J. Ramanujam and P. Sadayappan.<em> </em></span><span style="line-height:19px"><strong><em>Automatic C-to-CUDA Code Generation for Affine Programs</em>.</strong> International Conference on Compiler Construction</span><span style="background-color:rgb(255,255,255);text-align:left;color:rgb(0,0,0)"> </span><span style="line-height:19px">(</span><span style="line-height:19px">CC) 2010.</span></span></div>

<div><span style="font-size:medium"><span style="line-height:19px">3. </span><span style="background-color:rgb(255,255,255);color:rgb(0,0,0)">Chris Gregg and Kim Hazelwood. <em><strong>Where is the Data? Why You Cannot Debate GPU vs. CPU Performance Without the Answer</strong></em></span></span><em><strong><span style="background-color:rgb(255,255,255);color:rgb(0,0,0)">.</span></strong></em><span style="font-size:medium"><em><strong><span style="background-color:rgb(255,255,255);color:rgb(0,0,0)"> </span></strong></em>International Symposium on Performance Analysis of Systems and Software (ISPASS) 2011<span style="background-color:rgb(255,255,255);color:rgb(0,0,0)">.</span></span></div>

<div><span style="font-size:medium"><span style="background-color:rgb(255,255,255);color:rgb(0,0,0)">4. </span><a href="http://llvm.org/viewvc/llvm-project?view=rev&revision=153319">http://llvm.org/viewvc/llvm-project?view=rev&revision=153319</a>.</span></div>

</div></div>