[Libclc-dev] Any plan for OpenCL 1.2?

Mon Jul 20 18:31:17 PDT 2020

On Mon, Jul 20, 2020 at 8:26 PM Aaron Watry <awatry at gmail.com> wrote:
>
> On Mon, Jul 20, 2020 at 12:52 PM Jan Vesely <jan.vesely at rutgers.edu> wrote:
> >
> > On Mon, 2020-07-20 at 09:24 -0500, Aaron Watry via Libclc-dev wrote:
> > > On Sat, Jul 18, 2020, 11:53 PM DING, Yang via Libclc-dev <
> > > libclc-dev at lists.llvm.org> wrote:
> > >
> > > > Hi,
> > > >
> > > > It seems libclc currently implements the library requirements of the
> > > > OpenCL C programming language, as specified by the OpenCL 1.1
> > > > Specification.
> > > >
> > > > I am wondering if there is any active development or plan to upgrade
> > > > it to OpenCL 1.2? If not, what are the biggest challenges?
> > > >
> > >
> > > I haven't checked in a while, but I think the biggest blocker at this point
> > > is that we still don't have a printf implementation in libclc.  Most/all of
> > > the rest of the required functions are already implemented to expose 1.2.
> > >
> > > I had started on a pure-C printf implementation a while back that would in
> > > theory be portable to devices printing to a local/global buffer, but
> > > stalled out on it when I got to printing vector arguments and hex-float
> > > formats.  Also, the fact that global atomics in CL aren't guaranteed to be
> > > synchronized across all work groups executing a kernel (just within a given
> > > workgroup for a given global buffer).
> >
> > I don't think we need to worry about that. since both the amd and
> > nvptx atomics are atomic for all work groups we can just use that
> > behaviour. the actual atomic op would be target specific and if anyone
> > wants to add an additional target they add their own implementation
> > (SPIR-V can just use atomic with the right scope).
> > AMD targets can be switched to use GDS as an optimization later.
>
> Yeah, if we go the route of what I had started (not saying we should),
> then making it a target-specific implementation with no generic one is
> probably the easiest route.
>
> >
> > at least cl 1.2 printf only prints to stdout so we only need to
> > consider global memory.
> >
> > >
> > > If someone wants to take a peek or keep going with it, I've uploaded my WIP
> > > code for the printf implementation here: https://github.com/awatry/printf
> >
> > I'm not sure parsing the format string on the device is the best
> > approach as it will introduce quite a lot of divergence. it might be
> > easier/faster to just copy the format string and input data to the
> > buffer and let the host parse/print everything.
>
> Yeah, I don't remember if some of my notes from when I was working on
> this were along that line, but I know the thought crossed my head a
> few times (and I hadn't given up on the idea at all due to the
> performance, branchiness, and the sheer amount of code and
> stack/register pressure that the implementation I was working on would
> introduce).  If it weren't for the special vector output formats, we
> could pretty much forward the print format and arguments back to the
> host and just use the standard system printf. It might still be easier
> to only do special handling of that format (and there might've been
> one or two other differences from standard C printf, it's been a while
> since I started this).
>
> >
> > was the plan to:
> > 1.) parse the input once to get the number of bytes
> > 2.) atomic move writepointer
> > 3.) parse the input second time and print characters to the buffer
> >
> > or did you have anything more specialized in mind?
>
> The one that I was working on actually walked the print format input
> character by character until it hit a '%' (or anything else that was
> special) and when it came time to output anything, the idea would be
> that we'd use an atomic increment to allocate a character in the
> output buffer and write it. Racy, to be sure, and you'd end up with
> output interleaved from all threads attempting to write output
> simultaneously. A previous conversation I had indicated that the CL
> spec doesn't guarantee that atomic operations/buffers are synchronized
> across work groups, so that got me started down the mental path of
> partitioning the output buffer into N segments (where N is the number
> of work groups launched), so you could at least synchronize the output
> amongst work groups.

Ahh, yeah, and now the rust is slowly getting polished off.  I think I
had planned on creating the printf output in a private buffer/array
and then at the end of the printf operation (or whenever the private
buffer was full), flushing the built string to the global buffer
instead of writing 1 character at a time directly to the global
buffer.

Sorry for the rambling. I started this almost 3 years ago now, and
haven't touched it since Oct 2017, so the memory has faded a bit in
the interim.

--Aaron

>
> I will fully admit that the implementation has its issues, but from my
> reading of the spec I think it would've at least been compliant.
>
> That being said, I got a good start on a set of unit tests while
> working on it, so it wasn't a complete waste.  If Serge is working on
> an implementation that copies the format specs and arguments from the
> device to mesa in order to print them on the host, I'm more than
> willing to go with that, and I can probably port my tests over to
> piglit at some point just for a sanity check if the CTS isn't thorough
> enough.
>
> --Aaron
>
> >
> > thanks,
> > Jan
> >
> > >
> > > It's probably horrible, and may have to be re-written from scratch to
> > > actually work on a GPU, but it may be a start :)
> > >
> > > Thanks,
> > > Aaron
> > >
> > >
> > > > Thanks,
> > > > Yang
> > > > _______________________________________________
> > > > Libclc-dev mailing list
> > > > Libclc-dev at lists.llvm.org
> > > > https://lists.llvm.org/cgi-bin/mailman/listinfo/libclc-dev
> > > >
> > >
> > > _______________________________________________
> > > Libclc-dev mailing list
> > > Libclc-dev at lists.llvm.org
> > > https://lists.llvm.org/cgi-bin/mailman/listinfo/libclc-dev
> >
> >