Hi Justin,<br><br><div class="gmail_quote">2012/4/3 Justin Holewinski <span dir="ltr"><<a href="mailto:justin.holewinski@gmail.com">justin.holewinski@gmail.com</a>></span><br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div class="im"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><b>Motivation</b><br>With the broad proliferation of GPU computing, it is very important to provide an easy and automatic tool to develop or port the applications to GPU for normal developers, especially for those domain experts who want to harness the huge computing power of GPU. Polly has implemented many transformations, such as tiling, auto-vectorization and openmp code generation. With the help of LLVM's PTX backend, I plan to extend Polly with the feature of GPGPU code generation.</div>


</blockquote><div><br></div></div><div>Very interesting!  I'm quite familiar with Muthu's work, and putting that into LLVM would be great.  If done right, it could apply to any heterogeneous systems, including AMD GPUs.</div>

</blockquote><blockquote class="gmail_quote" style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><span style>As the maintainer and primary developer on the PTX back-end, please feel free to contact me with any issues/suggestions you have regarding the PTX back-end!</span> </blockquote>

</div><div><br></div>Thanks for your interest and help.<div><br></div><blockquote class="gmail_quote" style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

<span style>I'm a bit confused by the wording here.  What do you mean by 'LLVM sub-function?'  I'm assuming you mean extracting the relevant code into a separate function, but I would just use the word 'function'.</span></blockquote>

<div><br></div><div>Yes, it is indeed a function. I use this word by following the methods naming style of polly's openmp code generation. I will fix this.</div><div><br></div><blockquote class="gmail_quote" style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

<span style>And what do you mean by a run-time library to generate the executable program?</span></blockquote><div><br></div><div>The runtime library is just a wrapper of cuda driver APIs in my mind. But we can add our debug info and make the cuda APIs changes apparent to users.</div>

<div><br></div><div><br></div><blockquote class="gmail_quote" style="margin-top:0px;margin-right:0px;margin-bottom:0px;margin-left:0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

<div style>Are you proposing to side-step the LLVM code generator LLC?  It seems like a reasonable approach would be to write an LLVM pass (or set of passes) that takes as input a single IR file, and produces two: (1) the GPU kernel/device code, and (2) the non-translatable IR with GPU code replaced by appropriate CUDA Driver API calls.  Then, both of these can pass through the opt/llc tools with the appropriate selection for optimization passes and target back-end.</div>

<div style><br></div><div style>This way, you could fairly easily create a GPGPU compiler by writing a simple wrapper around Clang (or better yet, improve Clang to support multiple targets simultaneously!)</div></blockquote>

<div><br></div><div>Ether give a similar suggestion to this point. Here I copy the reply to him to explain why I choose to put the transformation pass embedded in my implementation.</div><div><br></div><div><span style>The original motivation we do this, is to provide a jit compiler for our language frontend (a subset of matlab/octave). I've extended lli to implement a jit compiler (named gvm) to use polly dynamically. However, preliminary results show that the overhead is heavy. I choose to offload the dynamic optimization from the jitting process.  And also putting the LLVM to PTX asm string pass into polly can provide a kind of one-touch experience to users. </span><span style>Please imagine such a user scenario</span><span style>.</span><span style>  When a user open a matlab source file or a folder contained source files, we can start to compile the source statically and use polly and opt to optimize it to get the optimal version llvm ir. Finally, when the user click run or the enter key, we just need jit the llvm ir as normal one, minimizing the dynamic overhead.</span></div>

<div style><br></div><div style><br></div><div style>Thanks again!</div><div style><br></div><div style>best regards,</div><div style>Yabin</div>