[Libclc-dev] Any plan for OpenCL 1.2?

Aaron Watry via Libclc-dev libclc-dev at lists.llvm.org
Mon Jul 20 18:26:40 PDT 2020


On Mon, Jul 20, 2020 at 12:52 PM Jan Vesely <jan.vesely at rutgers.edu> wrote:
>
> On Mon, 2020-07-20 at 09:24 -0500, Aaron Watry via Libclc-dev wrote:
> > On Sat, Jul 18, 2020, 11:53 PM DING, Yang via Libclc-dev <
> > libclc-dev at lists.llvm.org> wrote:
> >
> > > Hi,
> > >
> > > It seems libclc currently implements the library requirements of the
> > > OpenCL C programming language, as specified by the OpenCL 1.1
> > > Specification.
> > >
> > > I am wondering if there is any active development or plan to upgrade
> > > it to OpenCL 1.2? If not, what are the biggest challenges?
> > >
> >
> > I haven't checked in a while, but I think the biggest blocker at this point
> > is that we still don't have a printf implementation in libclc.  Most/all of
> > the rest of the required functions are already implemented to expose 1.2.
> >
> > I had started on a pure-C printf implementation a while back that would in
> > theory be portable to devices printing to a local/global buffer, but
> > stalled out on it when I got to printing vector arguments and hex-float
> > formats.  Also, the fact that global atomics in CL aren't guaranteed to be
> > synchronized across all work groups executing a kernel (just within a given
> > workgroup for a given global buffer).
>
> I don't think we need to worry about that. since both the amd and
> nvptx atomics are atomic for all work groups we can just use that
> behaviour. the actual atomic op would be target specific and if anyone
> wants to add an additional target they add their own implementation
> (SPIR-V can just use atomic with the right scope).
> AMD targets can be switched to use GDS as an optimization later.

Yeah, if we go the route of what I had started (not saying we should),
then making it a target-specific implementation with no generic one is
probably the easiest route.

>
> at least cl 1.2 printf only prints to stdout so we only need to
> consider global memory.
>
> >
> > If someone wants to take a peek or keep going with it, I've uploaded my WIP
> > code for the printf implementation here: https://github.com/awatry/printf
>
> I'm not sure parsing the format string on the device is the best
> approach as it will introduce quite a lot of divergence. it might be
> easier/faster to just copy the format string and input data to the
> buffer and let the host parse/print everything.

Yeah, I don't remember if some of my notes from when I was working on
this were along that line, but I know the thought crossed my head a
few times (and I hadn't given up on the idea at all due to the
performance, branchiness, and the sheer amount of code and
stack/register pressure that the implementation I was working on would
introduce).  If it weren't for the special vector output formats, we
could pretty much forward the print format and arguments back to the
host and just use the standard system printf. It might still be easier
to only do special handling of that format (and there might've been
one or two other differences from standard C printf, it's been a while
since I started this).

>
> was the plan to:
> 1.) parse the input once to get the number of bytes
> 2.) atomic move writepointer
> 3.) parse the input second time and print characters to the buffer
>
> or did you have anything more specialized in mind?

The one that I was working on actually walked the print format input
character by character until it hit a '%' (or anything else that was
special) and when it came time to output anything, the idea would be
that we'd use an atomic increment to allocate a character in the
output buffer and write it. Racy, to be sure, and you'd end up with
output interleaved from all threads attempting to write output
simultaneously. A previous conversation I had indicated that the CL
spec doesn't guarantee that atomic operations/buffers are synchronized
across work groups, so that got me started down the mental path of
partitioning the output buffer into N segments (where N is the number
of work groups launched), so you could at least synchronize the output
amongst work groups.

I will fully admit that the implementation has its issues, but from my
reading of the spec I think it would've at least been compliant.

That being said, I got a good start on a set of unit tests while
working on it, so it wasn't a complete waste.  If Serge is working on
an implementation that copies the format specs and arguments from the
device to mesa in order to print them on the host, I'm more than
willing to go with that, and I can probably port my tests over to
piglit at some point just for a sanity check if the CTS isn't thorough
enough.

--Aaron

>
> thanks,
> Jan
>
> >
> > It's probably horrible, and may have to be re-written from scratch to
> > actually work on a GPU, but it may be a start :)
> >
> > Thanks,
> > Aaron
> >
> >
> > > Thanks,
> > > Yang
> > > _______________________________________________
> > > Libclc-dev mailing list
> > > Libclc-dev at lists.llvm.org
> > > https://lists.llvm.org/cgi-bin/mailman/listinfo/libclc-dev
> > >
> >
> > _______________________________________________
> > Libclc-dev mailing list
> > Libclc-dev at lists.llvm.org
> > https://lists.llvm.org/cgi-bin/mailman/listinfo/libclc-dev
>
>


More information about the Libclc-dev mailing list