[LLVMdev] GSoC 2012 Proposal: Automatic GPGPU code generation for llvm

Justin Holewinski justin.holewinski at gmail.com
Tue Apr 3 07:30:58 PDT 2012


On Mon, Apr 2, 2012 at 7:16 AM, Yabin Hu <yabin.hwu at gmail.com> wrote:

> Hi all,
>
> I am a phd student from Huazhong University of Sci&Tech, China. The
> following is my GSoC 2012 proposal.
> Comments are welcome!
>
> *Title: Automatic GPGPU Code Generation for LLVM*
>
> *Abstract*
> Very often, manually developing an GPGPU application is a time-consuming,
> complex, error-prone and iterative process. In this project, I propose to
> build an automatic GPGPU code generation framework for LLVM, based on two
> successful LLVM (sub-)projects - Polly and PTX backend. This can be very
> useful to ease the burden of the long learning curve of various GPU
> programming model.
>
> *Motivation*
> With the broad proliferation of GPU computing, it is very important to
> provide an easy and automatic tool to develop or port the applications to
> GPU for normal developers, especially for those domain experts who want to
> harness the huge computing power of GPU. Polly has implemented many
> transformations, such as tiling, auto-vectorization and openmp code
> generation. With the help of LLVM's PTX backend, I plan to extend Polly
> with the feature of GPGPU code generation.
>

Very interesting!  I'm quite familiar with Muthu's work, and putting that
into LLVM would be great.  If done right, it could apply to any
heterogeneous systems, including AMD GPUs.

As the maintainer and primary developer on the PTX back-end, please feel
free to contact me with any issues/suggestions you have regarding the PTX
back-end!


>
>
> *Project Detail*
> In this project, we target various parallel loops which can be described
> by Polly's polyhedral model. We first translated the selected SCoPs (Static
> Control Parts) into 4-depth loops with Polly's schedule optimization. Then
> we extract the loop body (or inner non-parallel loops) into a LLVM
> sub-function, tagged with PTX_Kernel or PTX_Device call convention. After
> that, we use PTX backend to translate the subfunctions into a string of the
> corresponding PTX codes. Finally, we provide an runtime library to generate
> the executable program.
>

I'm a bit confused by the wording here.  What do you mean by 'LLVM
sub-function?'  I'm assuming you mean extracting the relevant code into a
separate function, but I would just use the word 'function'.

And what do you mean by a run-time library to generate the executable
program?  Are you proposing to side-step the LLVM code generator LLC?  It
seems like a reasonable approach would be to write an LLVM pass (or set of
passes) that takes as input a single IR file, and produces two: (1) the GPU
kernel/device code, and (2) the non-translatable IR with GPU code replaced
by appropriate CUDA Driver API calls.  Then, both of these can pass through
the opt/llc tools with the appropriate selection for optimization passes
and target back-end.

This way, you could fairly easily create a GPGPU compiler by writing a
simple wrapper around Clang (or better yet, improve Clang to support
multiple targets simultaneously!)


>
> There are three key challenges in this project here.
> 1. How to get the optimal execution configure of GPU codes.
> The execution configure is essential to the performance of the GPU codes.
> It is limited by many factors, including hardware, source codes, register
> usage, local store (device) usage, original memory access patterns and so
> on. We must take all the staff into consideration.
>
> 2. How to automatically insert the synchronization codes.
> This is very important to preserve the original semantics. We must detect
> where we need insert them correctly.
>
> 3. How to automatically generate the memory copy operation between host
> and device.
> We must transport the input data to GPU and copy the
> results back. Fortunately, Polly has implemented a very expressive way to
> describe memory access.
> *
> *
> *Timeline*
> May 21 ~ June 3 preliminary code generation for 1-d and 2d parallel loops.
> June 4 ~ June 11 code generation for parallel loops with non-parallel
> inner loops.
> June 11 ~ June 24 automatic memory copy insertions.
> June 25 ~ July 8 auto-tuning for GPU execution configuration.
> July 9 ~ July 15 Midterm evaluation and writing documents.
> July 16 ~ July 22 automatic synchronization insertion.
> July 23 ~ August 3 test on polybench benchmarks.
> August 4 ~ August 12 summarize and complete the final documents.
>
> *
> *
> *Project experience*
> I participated in several projects related to binary translation
> (optimization) and run-time system. And I implemented a frontend for
> numerical computing languages like octave/matlab, following the style of
> clang. Recently, I work very close with Polly team to contribute some
> patches and investigate lots of details about polyhedral transformation.
> *
> *
> *
> *
> *References*
> 1. Tobias Grosser, Ragesh A. *Polly - First Successful Optimizations -
> How to proceed?* LLVM Developer Meeting 2011.
> 2. Muthu Manikandan Baskaran, J. Ramanujam and P. Sadayappan.* **Automatic
> C-to-CUDA Code Generation for Affine Programs*. CC 2010.
> 3. Soufiane Baghdadi, Armin Größlinger, and Albert Cohen. *Putting
> Automatic Polyhedral Compilation for GPGPU to Work*. In Proc. of
> Compilers for Parallel Computers (CPC), 2010.
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>


-- 

Thanks,

Justin Holewinski
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20120403/d7a84314/attachment.html>


More information about the llvm-dev mailing list