<div class="gmail_quote">On Tue, Oct 4, 2011 at 5:42 PM, Peter Collingbourne <span dir="ltr"><<a href="mailto:peter@pcc.me.uk">peter@pcc.me.uk</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

<div><div></div><div class="h5">On Tue, Oct 04, 2011 at 04:23:59PM -0400, Justin Holewinski wrote:<br>

> > > I'm currently investigating the following issues/concerns:<br>

> > ><br>

> > >    1. What is the plan for language-specific functions and other<br>

> > constructs,<br>

> > >    such as __syncthreads/barrier, get_local_id/threadIdx, etc.?  Is it up<br>

> > to<br>

> > >    the back-end to define compatible definitions of these, or is there a<br>

> > plan<br>

> > >    to introduce generic LLVM intrinsics for these?  Since OpenCL has<br>

> > >    pre-defined functions that do not require header files, it may be<br>

> > awkward to<br>

> > >    require OpenCL to include a back-end specific header file when<br>

> > compiling<br>

> > >    with Clang.<br>

> ><br>

> > For OpenCL, the implementation should provide definitions of<br>

> > the built-in functions described in section 6.11 of the OpenCL<br>

> > specification.  For at least some of those functions, the definitions<br>

> > would be the same for any OpenCL implementation.  (FWIW, I have<br>

> > developed a set of generic implementations of section 6.11 built-ins<br>

> > as part of an OpenCL implementation I have been working on, which I<br>

> > will be open sourcing soon.)<br>

> ><br>

> > For the rest (e.g. work-item functions), the implementation would<br>

> > need to be specific to the OpenCL implementation.  For example, on<br>

> > a CPU, the exact implementation details of work-item functions would<br>

> > be highly dependent on how the implementation stores work-item IDs,<br>

> > so it would not be appropriate to use a generic intrinsic.<br>

> ><br>

><br>

> Right.  I'm wondering what the implementation plan for this is with Clang.<br>

> Are you going to expose the OpenCL functions as LLVM intrinsics, and let<br>

> back-ends provide appropriate implementations?  Right now, I'm defining<br>

> these functions in terms of PTX builtin functions, but this is obviously not<br>

> optimal because you need to include an additional header in OpenCL code.<br>

<br>

</div></div>This is how I imagine the built-ins should be implemented:<br>

<br>

The built-in functions would be declared by a header file that belongs<br>

to an OpenCL C runtime library (not to be confused with the OpenCL<br>

Platform Layer or OpenCL Runtime defined by sections 4 and 5 of the<br>

OpenCL specification).  The runtime library in this case would consist<br>

of a set of header files and (optionally) a static or shared library<br>

file which together implement section 6.11 of the OpenCL specification.<br>

The runtime library as a project would be a separate project from Clang<br>

(but it may be a potential LLVM sub-project).<br>

<br>

The driver would be extended to support locating the runtime<br>

library's main header file, which could be installed in a known<br>

location, pre-including it using the -include command line option<br>

to the frontend (so that the functions declared by the header file<br>

are available to every OpenCL program), and setting linker options<br>

so that the runtime library is linked into the final executable.<br></blockquote><div><br></div><div>This makes sense to me.  The run-time library for PTX would be fairly easy, since it would mostly just be stubs that call into PTX builtin functions.</div>

<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

<br>

Since my implementation of OpenCL is slightly unconventional (it<br>

is built into KLEE, a symbolic execution engine) I have not needed<br>

to implement any of the driver functionality (KLEE calls into the<br>

frontend directly, and the paths to the header and library files<br>

are hardcoded paths into the KLEE source and build directories),<br>

so I haven't thought too closely about the details.<br>

<div class="im"><br>

> > For CUDA, the NVIDIA header files provide appropriate declarations,<br>

> > but as far as I can tell, variables such as threadIdx are handled<br>

> > specially by nvcc, and functions such as __syncthreads are treated<br>

> > as builtins.  Clang does not currently implement the special handling<br>

> > for these variables or functions.<br>

> ><br>

><br>

> Are there any plans to implement any of these?<br>

<br>

</div>I doubt that I will have time to implement this myself, and I am<br>

unaware of anyone else who is willing to.<br></blockquote><div><br></div><div>I may take a look at the code to see what all would be involved.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">


<div class="im"><br>

> > On Tue, Oct 04, 2011 at 07:28:26PM +0100, Peter Collingbourne wrote:<br>

</div><div class="im">> > > > and the OpenCL frontend seems to respect the address<br>

> > > >    mapping but does not emit complete array definitions for<br>

> > locally-defined<br>

> > > >    __local arrays.  Does the front-end currently not support __local<br>

> > arrays<br>

> > > >    embedded in the code?  It seems to work if the __local arrays are<br>

> > passed as<br>

> > > >    pointers to the kernel.<br>

> > ><br>

> > > Clang should support __local arrays, and this looks like a genuine<br>

> > > bug in the IR generator.  I will investigate.<br>

> ><br>

</div><div class="im">> > This actually seems to be an optimisation.  Since only the first<br>

> > element of the array is accessed, LLVM will only allocate storage for<br>

> > that element.  If you compile your example with -O0 (OpenCL compiles<br>

> > with optimisations turned on by default), you will see that the 64<br>

> > element array is created.<br>

> ><br>

><br>

> I'm not really convinced this is a legal optimization.  What if you<br>

> purposely allocate arrays with extra padding to prevent bank conflicts in<br>

> the kernel?<br>

<br>

</div>Preventing bank conflicts is a reasonable thing for one to want to do,<br>

but allocating arrays with extra padding is not a standards-compliant<br>

way to do it, given that (as far as I'm aware) the OpenCL specification<br>

says nothing about how storage is allocated.  If you are willing<br>

to go outside the requirements of the specification, Clang supports<br>

the C1X _Alignas keyword as an extension in all languages.  So for<br>

example if you know that the target bank size is 1024, you could write:<br>

<br>

_Alignas(1024) __local float buffer[64];<br>

<br>

Thanks,<br>

<font color="#888888">--<br>

Peter<br>

</font></blockquote></div><br><br clear="all"><div><br></div>-- <br><br><div>Thanks,</div><div><br></div><div>Justin Holewinski</div><br>