[cfe-dev] [RFC][OpenMP][CUDA] Unified Offloading Support in Clang Driver

Mon Mar 7 18:38:09 PST 2016

The more I think about this, the more I'm convinced that having the
clang driver output a tar (or zip or whatever, we can bikeshed later)
makes a lot of sense, in terms of meeting all of our competing
requirements.

tar is trivially compatible with all existing tools.  It's also
trivial to edit.  Want to add or remove an object file from the
bundle?  Go for it.  Want to disassemble just one platform's code,
using a proprietary tool you don't control?  No problem.

tar also preserves the single-file-out behavior of the compiler, so it
should be compatible with existing build systems with minimal changes.

There's also a very simple model for explaining how tars would work
when linking with the clang driver:

  $ clang a.tar b.tar

is exactly equivalent to extracting a.tar and b.tar and then doing

  $ clang a.file1 a.file2 b.file1 b.file2

tar has the nice property that it has an unambiguous ordering.

One final advantage of tar is that it's trivial to memory map portions
of the tarball, so extracting one arch's object file is a zero-cost
operation.  And tars can be written incrementally, so if you don't do
your different architectures' compilations in parallel, each stage of
compilation could just write to the one tarball, letting you avoid a
copy.

If we instead bundle into a single object file, which seems to be the
other main proposal at the moment, we have to deal with a lot of
additional complexity:

* We'll have to come up with a scheme for all of the major object file
formats (ELF, MachO, whatever Windows uses)

* We'll basically be reinventing the wheel wrt a lot of the metadata
in an object file.  For example, ELF specifies the target ISA in its
header.  But our object files will contain multiple ISAs.  Similarly,
we'll have to come up with a debug info scheme, and come up with a way
to point between different debuginfo sections, in potentially
different (potentially proprietary!) formats, at different code for
different archs.  I'm no expert, but ELF really doesn't seem built for
this.  (Thus the existence of the FatELF project.)

* Unless we choose to do nested object files (which seems horrible),
our multiarch object file is not going to contain within it N valid
object files -- the data for each arch is going to be spread out, or
there are going to be some headers which can only appear once, or
whatever.  So you can't just mmap portions of our multiarch object
file to retrieve the bits you want, like you can with tar.

* We'll want everyone to agree to the specifics of this format.
That's a tall order no matter what we choose, but complexity will
stand in the way of getting something actually interoperable.

In addition, this scheme doesn't play nicely with existing tools -- we
will likely need to patch binutils rather extensively in order to get
it to play sanely with this format.

I don't think we need to force anyone to use tar.  For the Intel phi,
putting everything into one object file probably makes sense, because
it's all the same ISA.  For NVPTX, we may want to continue having the
option of compiling into something which looks like nvcc's format.
(Although an extracted tarball would be compatible with all of
nvidia's tools, afaik.)  Maybe we'll want to support other existing
formats as well, for compatibility.  And we can trivially have a flag
that tells clang not to tar its output, if you're allergic.

-Justin

On Mon, Mar 7, 2016 at 5:11 PM, Hal Finkel <hfinkel at anl.gov> wrote:
> ----- Original Message -----
>> From: "C Bergström via cfe-dev" <cfe-dev at lists.llvm.org>
>> To: "Justin Lebar" <jlebar at google.com>
>> Cc: "Alexey Bataev" <a.bataev at hotmail.com>, "C Bergström via cfe-dev" <cfe-dev at lists.llvm.org>, "Samuel F Antao"
>> <sfantao at us.ibm.com>, "John McCall" <rjmccall at gmail.com>
>> Sent: Monday, March 7, 2016 6:46:57 PM
>> Subject: Re: [cfe-dev] [RFC][OpenMP][CUDA] Unified Offloading Support in      Clang Driver
>>
>> #1 OMP allows multiple code generation, but doesn't *require* it. It
>> wouldn't be invalid code if you only generated for a single target at
>> a time - which imho isn't that unreasonable. Why?! It's unlikely that
>> a user is going to try to build a single binary that runs across
>> different supercomputers. It's not like what ANL is getting will be a
>> mix (today/near future) of PHI+GPU. It will be a PHI shop.. ORNL is a
>> GPU shop. The only burden is the user building the source twice (not
>> that hard and what they do today anyway)
>
> I agree, but supercomputers are not the only relevant platforms. Lot's of people have GPUs, and OpenMP offloading can also be used on various kinds of heterogeneous systems. I see no reason to design, at the driver level, for only a single target device type unless doing more is a significant implementation burden.
>
>  -Hal
>
>>
>> #2 This proposed tarball hack/wrapper thingie is just yuck design
>> imho. I think there are better and more clean long term solutions
>>
>> #3 re: "ARM ISA and Thumb via a runtime switch.  As a result, objdump
>> has a very difficult time figuring out how to disassemble code that
>> uses both ARM and Thumb."
>>
>> My proposed solution of prefixing/name mangling the symbol to include
>> "target" or optimization level solves this. It's almost exactly as
>> what glibc (I spent 15 minutes looking for this doc I've seen before,
>> but couldn't find it.. if really needed I'll go dig in the glibc
>> sources for examples - the "doc" I'm looking for could be on the
>> loader side though)
>>
>> In the meantime there's also this
>>
>> https://sourceware.org/glibc/wiki/libmvec
>> "For x86_64 vector functions names are created based on #2.6. Vector
>> Function Name Mangling from Vector ABI"
>>
>> https://sourceware.org/glibc/wiki/libmvec?action=AttachFile&do=view&target=VectorABI.txt
>>
>> Which explicitly handles this case and explicitly mentions OMP.
>> "Vector Function ABI provides ABI for vector functions generated by
>> compiler supporting SIMD constructs of OpenMP 4.0 [1]."
>>
>>
>> it may also be worthwhile looking at existing solutions more closely
>> https://gcc.gnu.org/wiki/FunctionSpecificOpt
>> https://gcc.gnu.org/onlinedocs/gcc-4.9.2/gcc/Function-Attributes.html
>>
>> "The target attribute is used to specify that a function is to be
>> compiled with different target options than specified on the command
>> line. This can be used for instance to have functions compiled with a
>> different ISA"
>> -------------
>> Side note on ARM - Their semi-unified ISA is actually the "right way"
>> to go. It's imho a good thing to have the vector or gpu instructions
>> unified as a direct extension to the scalar stuff. I won't go into
>> low
>> level details why, but in the end that design would win over one
>> where
>> unified memory is possible, but separate load/store is required for
>> left side to talk to right side. (iow ARM thumb is imho not the
>> problem.. it's objdump - anyway objdump can't handle nvidia saas/ptx
>> blobs so it's probably more a "nice to have" instead of absolute
>> blocker)
>> _______________________________________________
>> cfe-dev mailing list
>> cfe-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>
>
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory