[cfe-dev] [RFC][OpenMP][CUDA] Unified Offloading Support in Clang Driver

Mon Mar 7 16:46:57 PST 2016

#1 OMP allows multiple code generation, but doesn't *require* it. It
wouldn't be invalid code if you only generated for a single target at
a time - which imho isn't that unreasonable. Why?! It's unlikely that
a user is going to try to build a single binary that runs across
different supercomputers. It's not like what ANL is getting will be a
mix (today/near future) of PHI+GPU. It will be a PHI shop.. ORNL is a
GPU shop. The only burden is the user building the source twice (not
that hard and what they do today anyway)

#2 This proposed tarball hack/wrapper thingie is just yuck design
imho. I think there are better and more clean long term solutions

#3 re: "ARM ISA and Thumb via a runtime switch.  As a result, objdump
has a very difficult time figuring out how to disassemble code that
uses both ARM and Thumb."

My proposed solution of prefixing/name mangling the symbol to include
"target" or optimization level solves this. It's almost exactly as
what glibc (I spent 15 minutes looking for this doc I've seen before,
but couldn't find it.. if really needed I'll go dig in the glibc
sources for examples - the "doc" I'm looking for could be on the
loader side though)

In the meantime there's also this

https://sourceware.org/glibc/wiki/libmvec
"For x86_64 vector functions names are created based on #2.6. Vector
Function Name Mangling from Vector ABI"

https://sourceware.org/glibc/wiki/libmvec?action=AttachFile&do=view&target=VectorABI.txt

Which explicitly handles this case and explicitly mentions OMP.
"Vector Function ABI provides ABI for vector functions generated by
compiler supporting SIMD constructs of OpenMP 4.0 [1]."

it may also be worthwhile looking at existing solutions more closely
https://gcc.gnu.org/wiki/FunctionSpecificOpt
https://gcc.gnu.org/onlinedocs/gcc-4.9.2/gcc/Function-Attributes.html

"The target attribute is used to specify that a function is to be
compiled with different target options than specified on the command
line. This can be used for instance to have functions compiled with a
different ISA"
-------------
Side note on ARM - Their semi-unified ISA is actually the "right way"
to go. It's imho a good thing to have the vector or gpu instructions
unified as a direct extension to the scalar stuff. I won't go into low
level details why, but in the end that design would win over one where
unified memory is possible, but separate load/store is required for
left side to talk to right side. (iow ARM thumb is imho not the
problem.. it's objdump - anyway objdump can't handle nvidia saas/ptx
blobs so it's probably more a "nice to have" instead of absolute
blocker)