[cfe-dev] [RFC][OpenMP][CUDA] Unified Offloading Support in Clang Driver

Tue Mar 8 07:37:00 PST 2016

>  4. Store the device code in special sections of the host object file. This seems the most build-system friendly, although perhaps the most complicated to implementation on our end. Also, as has been pointed out, is also the technique nvcc uses.

Making the device code stored in special (non-loadable) sections of the host object (Hal’s option 4) is what Intel’s compiler  has implemented and would like to support for the following reasons:

- We have single source files supporting offloading to devices. That should produce single fat objects that include the device objects as well.

- Invocation of linker will result in a single dynamic library or executable with device binaries embedded.

- It will make the use of device offload transparent to users, support separate compilations, and existing Makefiles.

- It ensures the host and target object dependencies are easily maintained.

  Makefiles create object files and may move them during build process.  These Makefiles will have to be changed to support separate device objects. Naming conventions could also be an issue for separate device    objects.

- Static libraries can be made up of fat objects as well. When the driver invokes the target linker it knows to look for device objects in the static libraries as well.

  Static libraries provided is still only one library even with device code embedded.

- With support for fat executables/dynamic libraries it should be fairly straight forward to make fat objects as well.

- Customers we have worked with have provided the feedback to generate fat objects for ease of use.

- This does leave an open issue of how to assembly files and intermediate IT files are handled and naming conventions for these.

Thanks,

Knud

From: Andrey Bokhanko [mailto:andreybokhanko at gmail.com]
Sent: Monday, March 07, 2016 1:43 PM
To: Kirkegaard, Knud J <knud.j.kirkegaard at intel.com<mailto:knud.j.kirkegaard at intel.com>>
Subject: Fwd: [cfe-dev] [RFC][OpenMP][CUDA] Unified Offloading Support in Clang Driver

---------- Forwarded message ----------
From: Hal Finkel via cfe-dev <cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>>
Date: Mon, Mar 7, 2016 at 7:56 AM
Subject: Re: [cfe-dev] [RFC][OpenMP][CUDA] Unified Offloading Support in Clang Driver
To: Justin Lebar <jlebar at google.com<mailto:jlebar at google.com>>
Cc: Alexey Bataev <a.bataev at hotmail.com<mailto:a.bataev at hotmail.com>>, C Bergström via cfe-dev <cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>>, John McCall <rjmccall at gmail.com<mailto:rjmccall at gmail.com>>, Samuel F Antao <sfantao at us.ibm.com<mailto:sfantao at us.ibm.com>>

----- Original Message -----
> From: "Justin Lebar via cfe-dev" <cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>>
> To: "Samuel F Antao" <sfantao at us.ibm.com<mailto:sfantao at us.ibm.com>>
> Cc: "Alexey Bataev" <a.bataev at hotmail.com<mailto:a.bataev at hotmail.com>>, "C Bergström via cfe-dev" <cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>>, "John McCall"
> <rjmccall at gmail.com<mailto:rjmccall at gmail.com>>
> Sent: Saturday, March 5, 2016 11:18:54 AM
> Subject: Re: [cfe-dev] [RFC][OpenMP][CUDA] Unified Offloading Support in      Clang Driver
>
> > Ok, you could link each component, but then you couldn't do
> > anything, because the device side only works if you have that
> > specific host code, allocating the data and invoking the kernel.
>
> Sure, you'd have to do something after linking to group everything
> together.  Like, more reasonably, you could link together all the
> device object files, and then link together the host object files
> plus
> the one device blob using a tool which understands this blob.
>
> Or you could just pass all of the object files to a tool which
> understands the difference between host and device object files and
> will DTRT.
>
> > B: nvcc -rdc=true a.o b.o -o a.out
> > Wouldn't be desirable to have clang supporting case B as well?
>
> Sure, yes.  It's maybe worth elaborating on how we support case A
> today.  We compile the .cu file for device once for each device
> architecture, generating N .s files and N corresponding .o files.
> (The .s files are assembled by a black-box tool from nvidia.)  We
> then
> feed both the .s and .o files to another tool from nvidia, which
> makes
> one "fat binary".  We finally incorporate the fatbin into the host
> object file while compiling.
>
> Which sounds a lot like what I was telling you I didn't want to do, I
> know.  :)  But the reason I think it's different is that there exists
> a widely-adopted one-object-file format for cuda/nvptx.  So if you do
> the above in the right way, which we do, all of nvidia's binary tools
> (objdump, etc) just work.  Moreover, there are no real alternative
> tools to break by this scheme -- the ISA is proprietary, and nobody
> has bothered to write such a tool, to my knowledge.  If they did, I
> suspect they'd make it compatible with nvidia's (and thus our)
> format.
>
> Since we already have this format and it's well-supported by tools
> etc, we'd probably want to support in clang unbundling the CUDA code
> at linktime, just like nvcc.
>
> Anyway, back to your question, where we're dealing with an ISA which
> does not have a well-established bundling format.  In this case, I
> don't think it would be unreasonable to support
>
>   clang a-host.o a-device.o b-host.o b-device.o -o a.out
>
> clang could presumably figure out the architecture of each file
> either
> from its name, from some sort of -x params, or by inspecting the file
> -- all three would have good precedent.
>
> The only issue is whether or not this should instead look like
>
>   clang a.tar b.tar -o a.out
>
> The functionality is exactly the same.
>
> If we use tar or invent a new format, we don't necessarily have to
> change build systems.  But we've either opened a new can of worms by
> adding a rather more expressive than we want file format into clang
> (tar is the obvious choice, but it's not a great fit; no random
> access, no custom metadata, lots of edge cases to handle as errors,
> etc), or we've made up a new file format with all the problems we've
> discussed.
Many of the projects that will use this feature are very large, with highly non-trivial build systems. Requiring significant changes (beyond the normal changes to compiler paths and flags) in order to use OpenMP (including with accelerator support) should be avoided where ever possible. This is much more important than ensuring tools compatibility with other compilers (although accomplishing both goals simultaneously seems even better still). Based on my reading of this thread, it seems like we have several options (these come to mind):

 1. Use multiple object files. Many build-system changes can be avoided by using some scheme for guessing the name of the device object files from that of the host object file. This often won't work, however, because many build systems copy object files around, add them to static archives, etc. and the device object files would be missed in this operations.

 2. Use some kind of bundling format. tar, zip, ar, etc. seem like workable options. Any user who runs 'file' on them will easily guess how to extract the data. objdump, etc., however, won't know how to handle these directly (which can also have build-system implications, although more rare than for (1)).

 3. Treat the input/output object file name as a directory, and store in that directory the host and device object files. This might be effectively transparent, but also suffers from potential build-system problems (rm -f won't work, for example).

 4. Store the device code in special sections of the host object file. This seems the most build-system friendly, although perhaps the most complicated to implementation on our end. Also, as has been pointed out, is also the technique nvcc uses.

All things considered, I think that I'd prefer (4). If we're picking an option to minimize build-system changes, which I fully support, picking the option with the smallest chance of incompatibilities seems optimal. There is also other (prior) art here, and we should find out how GCC is handling this in GCC 6 for OpenACC and/or OpenMP 4 (https://gcc.gnu.org/wiki/OpenACC). Also, we can check on PGI and/or Pathscale (for OpenACC, OpenHMPP, etc.), in addition to any relevant details of what nvcc does here.

Thanks again,
Hal

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20160308/06bdbf8b/attachment.html>