[Libclc-dev] [PATCH 1/2] Make image builtins r600/llvm-3.9 only
    Aaron Watry via Libclc-dev 
    libclc-dev at lists.llvm.org
       
    Tue Oct 10 19:03:12 PDT 2017
    
    
  
On Tue, Oct 10, 2017 at 3:58 PM, Jeroen Ketema <j.ketema at xs4all.nl> wrote:
>
>
>> On 10 Oct 2017, at 20:57, Aaron Watry via Libclc-dev <libclc-dev at lists.llvm.org> wrote:
>>
>> On Tue, Oct 10, 2017 at 1:16 PM, Jan Vesely <jan.vesely at rutgers.edu> wrote:
>>> On Tue, 2017-10-10 at 10:58 -0500, Aaron Watry via Libclc-dev wrote:
>>>> On Tue, Oct 10, 2017 at 9:25 AM, Jan Vesely <jan.vesely at rutgers.edu> wrote:
>>>>> On Mon, 2017-10-09 at 21:28 -0500, Aaron Watry via Libclc-dev wrote:
>>>>>> On Sun, Oct 8, 2017 at 1:23 PM, Jan Vesely via Libclc-dev
>>>>>> <libclc-dev at lists.llvm.org> wrote:
>>>>>>> On Sun, 2017-10-01 at 14:15 -0400, Jan Vesely wrote:
>>>>>>>> The implementation uses r600 specific intrinsics
>>>>>>>> LLVM-4 generates calls to functions using _ro_t and _rw_t image types
>>>>>>>> Portions of the code can be moved back as more targets/llvm versions add image support
>>>>>>>>
>>>>>>>> Signed-off-by: Jan Vesely <jan.vesely at rutgers.edu>
>>>>>>>> ---
>>>>>>>> AMDGPUOpenCLImageTypeLoweringPass relies on old style of kernel metadata
>>>>>>>> passing, so I'm pretty sure this does not work in 3.9 either, but it at
>>>>>>>> least does not generate external calls.
>>>>>>>> I'd like to keep the code around as a reference when someone starts
>>>>>>>> resurrecting image builtins.
>>>>>>>
>>>>>>> ping. this is just a code cleanup. No change in functionality.
>>>>>>
>>>>>> Looks ok to me.  I agree it would be nice to resurrect image support
>>>>>> in newer llvm, so lets keep this around for now.
>>>>>
>>>>> thanks. Does it apply to 2/2 as well?
>>>>
>>>> Yes, it does.  I just manually tested the check_external_calls.sh
>>>> script against both r600 and pitcairn's built libs/aliases on LLVM
>>>> 6.0, and the image calls are now gone. I don't have a LLVM 3.9 install
>>>> around at the moment to test my BARTS with, unfortunately. Maybe I'll
>>>> build that release at some point and install it in its own prefix.
>>>
>>> oh, I should have mentioned that it's in the images branch here:
>>> https://travis-ci.org/jvesely/libclc/branches
>>>
>>> I was considering to ask the github llvm-mirror person to enable travis
>>> an forward email to the list, or maybe setup our own mirror with CI
>>> enabled. or maybe add libclc to official llvm build farm.
>>>
>>>>
>>>>>
>>>>>>
>>>>>> If nothing else, it might give me some ideas on how to make an
>>>>>> implicit global buffer available for the printf implementation.
>>>>>
>>>>> I don't think images are a good model for that. It'd be nicer to use
>>>>> standard implicit parameter passing on clover side and
>>>>> __buitltin_implicitarg_ptr on libclc side, without any llvm changes.
>>>>> (It'd be even nicer if NDRange arguments were converted first, so they
>>>>> are all together in one place)
>>>>>
>>>>
>>>> Yeah, I have been considering the implicit parameter passing that
>>>> clover does, and had been leaning towards attempting to do that with a
>>>>> =1MB buffer for printf output. The other thing that will probably be
>>>>
>>>> needed is an atomic counter to keep track of our current location in
>>>> that buffer when printing.
>>>>
>>>> I'm a fair bit of the way done with having a clean-slate pure-C
>>>> implementation of printf done that matches the CLC requirements that
>>>> I've just been testing for identical output against my system printf.
>>>> I'll probably post that as an RFC in its pure C form before converting
>>>> it to CLC code just to save myself a bit of refactoring effort later
>>>> if needed.  Once that's done, I'll take a stab at converting it to
>>>> CLC, probably have to bug-fix clang to enable varargs function calls
>>>> specifically for printf (CL 2.0 spec, section 6.9 part e specifies
>>>> that variadic functions/macros are only allowed for
>>>> printf/enqueue_kernel), which I don't believe it allows yet, and then
>>>> I'll get the implicit parameter passing set up for clover (hopefully).
>>>
>>> I don't think you need to implement full printf in libclc. It should be
>>> enough to dump format string and arguments
>
> There’s not even a reason to dump the format string in the buffer, as per
> the OpenCL specification it’s known at compile time what the format
> string is going to be.
>
>>> into the buffer and let
>>> clover to the ugly part. printf code doesn't sound like something that
>>> would run efficiently on GPUs anyway.
>>
>> I considered forwarding the format spec and arguments to clover, but I
>> figured that we would need a custom printf implementation given that
>> CLC's printf supports the vector length specifier (and hl length) that
>> standard printf doesn't anyway, and that CLC doesn't support the full
>> set of printf specifiers that normal C allows.
>>
>> The same comment from my previous mail about enabling varargs calls in
>> CLC for printf still would apply, at which point the real question is
>
> I don’t get the argument about varargs. Contrary to the situation on a host
> system both the compiler and printf library function are under your control
> here. Hence, you can use the compiler to get rid of any vararg function
> calls before they ever reach the hardware.
>
>> performance, and how we'd deal with sending variable length arguments
>> back to clover in a buffer (probably just a giant byte array?).  I'm
>> not really convinced that printf is a performance critical function.
>> There may also be complexities involved in forwarding the printf
>> argument list back to clover in the cases where there are more
>> arguments supplied to the printf call than the format calls for.
>>
>> In this case, I figured that we at least needed a global buffer (and
>> output size variable) that was passed implicitly by clover to libclc
>> to dump the print output in, at which point I think we can just get
>> away with an atomically incremented/forwarded position variable for
>> where in the buffer the cursor currently is so that multiple threads
>> can output to the buffer in parallel.
>
> Be careful here: OpenCL 1.2 does not guarantee that atomic
> operations work across work-groups. Hence, you’re going
> to need a buffer per work-group.
Well, that's unfortunate (to put it mildly).  Global atomics end up
being pretty pointless if they're not actually guaranteed to be
atomic.
This may be do-able if we end up using atomic counters to guarantee
that global atomicity, but that would add one more thing that has to
be supported before we can expose printf:
https://www.khronos.org/registry/OpenCL/extensions/ext/cl_ext_atomic_counters_32.txt
But yeah, it's possible to create either a buffer per group, or just
one larger buffer that's sub-divided between the number of workgroups
being executed. That does eventually get us into buffer-sizing
requirements (CL spec requires a minimum of 1MB, but whether that'd be
per workgroup or total is possibly up to interpretation).
Either way, thanks for the warning. It would've been unfortunate to
find this out the hard way.
--Aaron
>
> Jeroen
>
>>
>> It honestly seems easier to do this in raw CLC than to pass the
>> contents back to clover to run through a custom printf function
>> anyway.  But what I have is currently in C, so I guess we could
>> consider it...
>>
>> --Aaron
>>
>>>
>>> Jan
>>>
>>>>
>>>> --Aaron
>>>>
>>>>> Jan
>>>>>
>>>>>>
>>>>>> --Aaron
>>>>>>
>>>>>>>
>>>>>>> Jan
>>>>>>>
>>>>>>>>
>>>>>>>> Jan
>
    
    
More information about the Libclc-dev
mailing list