[cfe-dev] [RFC][OpenMP][CUDA] Unified Offloading Support in Clang Driver

Mon Mar 7 12:53:49 PST 2016

> Requiring significant changes (beyond the normal changes to compiler paths and flags) in order to use OpenMP (including with accelerator support) should be avoided where ever possible.

One of the reasons I'm not convinced we should rule out creating
multiple object files is that if modifying your build system to
support this is hard, it's trivial to create a wrapper script to tar
and untar your object files:

clang-wrapper -c foo.cpp -fflags -o foo.tar
# Creates foo.tar containing foo.o, foo-sm_35.o, foo-compute_35.s.
clang-wrapper -link foo.tar bar.tar
# Untars foo.tar, bar.tar, and runs
#  clang foo.o foo-sm_35.o foo-compute_35.s ...

We're talking ~100 lines of Python here, which would represent a tiny
amount of complexity atop an already highly complex build system.

If for some reason using tar isn't an option, one could write a
wrapper which basically makes a tar out of the object file, shoving
all of the non-host code into special sections into the object file,
as you've suggested.  This shouldn't be substantially more complex
than creating a tar, and I think we agree that this would be very
unlikely to cause problems with a build system.

I'm not arguing here that such a wrapper is desirable, just that it's
possible and not particularly complex.  This, I think, expands the
universe of possibilities available for our consideration on the
compiler side.  I'd also like to have something which requires minimal
build system changes and is compatible with existing tools, even if my
priorities are inverted from yours.  :)

(FWIW I think the main arguments against such a wrapper are probably
its performance impact, and perhaps that if this is something everyone
is going to use, we should just build it in by default.)

I agree that the next step should be to look at prior art.  It seems
to me that we don't need to solve the general problem of multiarch
compilation here -- we just need a solution for the architectures we
care about now and in the near future.  We already have an NVPTX
solution that I think is acceptable to everyone?  So what other
architectures do we need to look at, and what do existing compilers
do?

> Internally we (pathscale) use unified objects with symbol name mangling for devise sections.

One of the most common complaints I hear about ARM is that it can
switch between the full ARM ISA and Thumb via a runtime switch.  As a
result, objdump has a very difficult time figuring out how to
disassemble code that uses both ARM and Thumb.

This sounds like a path towards that Dark Side.  Not quite as bad, and
maybe not as bad as stuffing everything in a data section, but still.
:)

> The way llvm would handle sse4 symbol vs avx512 symbol overlaps with this quite a bit.

Eh, sse4 and avx512 instructions are unambiguous, so that seems
totally sensible.

-Justin

On Sun, Mar 6, 2016 at 8:56 PM, Hal Finkel <hfinkel at anl.gov> wrote:
> ----- Original Message -----
>> From: "Justin Lebar via cfe-dev" <cfe-dev at lists.llvm.org>
>> To: "Samuel F Antao" <sfantao at us.ibm.com>
>> Cc: "Alexey Bataev" <a.bataev at hotmail.com>, "C Bergström via cfe-dev" <cfe-dev at lists.llvm.org>, "John McCall"
>> <rjmccall at gmail.com>
>> Sent: Saturday, March 5, 2016 11:18:54 AM
>> Subject: Re: [cfe-dev] [RFC][OpenMP][CUDA] Unified Offloading Support in      Clang Driver
>>
>> > Ok, you could link each component, but then you couldn't do
>> > anything, because the device side only works if you have that
>> > specific host code, allocating the data and invoking the kernel.
>>
>> Sure, you'd have to do something after linking to group everything
>> together.  Like, more reasonably, you could link together all the
>> device object files, and then link together the host object files
>> plus
>> the one device blob using a tool which understands this blob.
>>
>> Or you could just pass all of the object files to a tool which
>> understands the difference between host and device object files and
>> will DTRT.
>>
>> > B: nvcc -rdc=true a.o b.o -o a.out
>> > Wouldn't be desirable to have clang supporting case B as well?
>>
>> Sure, yes.  It's maybe worth elaborating on how we support case A
>> today.  We compile the .cu file for device once for each device
>> architecture, generating N .s files and N corresponding .o files.
>> (The .s files are assembled by a black-box tool from nvidia.)  We
>> then
>> feed both the .s and .o files to another tool from nvidia, which
>> makes
>> one "fat binary".  We finally incorporate the fatbin into the host
>> object file while compiling.
>>
>> Which sounds a lot like what I was telling you I didn't want to do, I
>> know.  :)  But the reason I think it's different is that there exists
>> a widely-adopted one-object-file format for cuda/nvptx.  So if you do
>> the above in the right way, which we do, all of nvidia's binary tools
>> (objdump, etc) just work.  Moreover, there are no real alternative
>> tools to break by this scheme -- the ISA is proprietary, and nobody
>> has bothered to write such a tool, to my knowledge.  If they did, I
>> suspect they'd make it compatible with nvidia's (and thus our)
>> format.
>>
>> Since we already have this format and it's well-supported by tools
>> etc, we'd probably want to support in clang unbundling the CUDA code
>> at linktime, just like nvcc.
>>
>> Anyway, back to your question, where we're dealing with an ISA which
>> does not have a well-established bundling format.  In this case, I
>> don't think it would be unreasonable to support
>>
>>   clang a-host.o a-device.o b-host.o b-device.o -o a.out
>>
>> clang could presumably figure out the architecture of each file
>> either
>> from its name, from some sort of -x params, or by inspecting the file
>> -- all three would have good precedent.
>>
>> The only issue is whether or not this should instead look like
>>
>>   clang a.tar b.tar -o a.out
>>
>> The functionality is exactly the same.
>>
>> If we use tar or invent a new format, we don't necessarily have to
>> change build systems.  But we've either opened a new can of worms by
>> adding a rather more expressive than we want file format into clang
>> (tar is the obvious choice, but it's not a great fit; no random
>> access, no custom metadata, lots of edge cases to handle as errors,
>> etc), or we've made up a new file format with all the problems we've
>> discussed.
>
> Many of the projects that will use this feature are very large, with highly non-trivial build systems. Requiring significant changes (beyond the normal changes to compiler paths and flags) in order to use OpenMP (including with accelerator support) should be avoided where ever possible. This is much more important than ensuring tools compatibility with other compilers (although accomplishing both goals simultaneously seems even better still). Based on my reading of this thread, it seems like we have several options (these come to mind):
>
>  1. Use multiple object files. Many build-system changes can be avoided by using some scheme for guessing the name of the device object files from that of the host object file. This often won't work, however, because many build systems copy object files around, add them to static archives, etc. and the device object files would be missed in this operations.
>
>  2. Use some kind of bundling format. tar, zip, ar, etc. seem like workable options. Any user who runs 'file' on them will easily guess how to extract the data. objdump, etc., however, won't know how to handle these directly (which can also have build-system implications, although more rare than for (1)).
>
>  3. Treat the input/output object file name as a directory, and store in that directory the host and device object files. This might be effectively transparent, but also suffers from potential build-system problems (rm -f won't work, for example).
>
>  4. Store the device code in special sections of the host object file. This seems the most build-system friendly, although perhaps the most complicated to implementation on our end. Also, as has been pointed out, is also the technique nvcc uses.
>
> All things considered, I think that I'd prefer (4). If we're picking an option to minimize build-system changes, which I fully support, picking the option with the smallest chance of incompatibilities seems optimal. There is also other (prior) art here, and we should find out how GCC is handling this in GCC 6 for OpenACC and/or OpenMP 4 (https://gcc.gnu.org/wiki/OpenACC). Also, we can check on PGI and/or Pathscale (for OpenACC, OpenHMPP, etc.), in addition to any relevant details of what nvcc does here.
>
> Thanks again,
> Hal
>
>>
>> -Justin
>>
>> On Fri, Mar 4, 2016 at 5:06 PM, Samuel F Antao <sfantao at us.ibm.com>
>> wrote:
>> >
>> > 2016-03-04 19:42 GMT-05:00 Justin Lebar via cfe-dev
>> > <cfe-dev at lists.llvm.org>:
>> >>
>> >> > In your opinion, if we add support for a new programming model,
>> >> > say
>> >> > OpenMP, should we try to convert CudaAction in something more
>> >> > generic (say
>> >> > DeviceAction) or adding actions for each programming model is
>> >> > the way to go?
>> >>
>> >> Oh, I don't have a strong opinion.  It may or may not make sense
>> >> to
>> >> combine them, depending on whether OpenMP needs different
>> >> arguments to
>> >> its actions than CUDA needs.
>> >>
>> >> > How would generate two separate files help ld?
>> >>
>> >> Presumably you could still link all the host object files
>> >> together,
>> >> and, if you have a linker for your device target, you could also
>> >> link
>> >> those.
>> >
>> >
>> > Ok, you could link each component, but then you couldn't do
>> > anything,
>> > because the device side only works if you have that specific host
>> > code,
>> > allocating the data and invoking the kernel. Unless you compile
>> > CUDA code
>> > and then disregard the host code use only the device code with some
>> > unrelated host object, which seems a rather twisted use case.
>> >
>> >>
>> >>
>> >> > So, if we are to use a wrapper around the driver you would be
>> >> > able to
>> >> > pack the outputs in whatever format. What about the inputs? We
>> >> > would need to
>> >> > add options to enable passing multiple inputs for the same
>> >> > compilation,
>> >> > right?
>> >>
>> >> I see inputs as a completely different question from outputs.
>> >>
>> >> In CUDA, a single input file contains both host and device code.
>> >>  I
>> >> presume the same is for OpenMP?  If for some reason you need to
>> >> pass
>> >> in multiple input files to a single compilation (setting aside the
>> >> question of whether or not this is a good requirement to have --
>> >> it
>> >> seems like a big departure from how C++ compilation normally
>> >> works),
>> >> you can just pass multiple inputs to clang.  Certainly we
>> >> shouldn't
>> >> expect users to bundle up multiple input files using some external
>> >> tool just to pass them to the driver?
>> >>
>> >> Maybe I'm missing something here again, sorry.
>> >>
>> >
>> > Yes, for OpenMP is the same. The problem is not when the input is
>> > source,
>> > but when we do separate compilation. I know the current CUDA
>> > implementation
>> > in clang doesn't support it, but let's assume I would like to make
>> > something
>> > on top of the current implementation to make it work.
>> >
>> > I have a.cu and b.cu and b.cu, both with a CUDA kernel. Now, b.cu
>> > has a
>> > device function that is also used in a.cu.
>> >
>> > If I use NVCC I could do:
>> >
>> > A: nvcc a.cu b.cu -rdc=true -o a.out
>> >
>> > but I also could do:
>> >
>> > B: nvcc a.cu -rdc=true -c
>> > B: nvcc b.cu -rdc=true -c
>> > B: nvcc -rdc=true a.o b.o -o a.out
>> >
>> > (nvcc incorporates device code in the *.o, then at link time it
>> > extracts it,
>> > link it, and embeds the result on the host)
>> >
>> > Wouldn't be desirable to have clang supporting case B as well? I
>> > don't have
>> > statistics, but I suspect that most of the applications use B, I
>> > think it is
>> > not common to have users to pass all the source files at once to
>> > the
>> > compiler. Maybe in CUDA you find several A's (kernels are
>> > explicitly
>> > outlined so users cared about organizing the code differently), but
>> > for
>> > OpenMP, B's is going to be the majority.
>> >
>> > Thanks,
>> > Samuel
>> >
>> >>
>> >>
>> >> On Fri, Mar 4, 2016 at 2:37 PM, Samuel F Antao
>> >> <sfantao at us.ibm.com> wrote:
>> >> > Hi Justin,
>> >> >
>> >> > 2016-03-04 17:20 GMT-05:00 Justin Lebar via cfe-dev
>> >> > <cfe-dev at lists.llvm.org>:
>> >> >>
>> >> >> > So, in your opinion, should we create an action for each
>> >> >> > programing
>> >> >> > model or
>> >> >> > should we have a generic one?
>> >> >>
>> >> >> We currently have generic Actions, like "CompileAction".  I
>> >> >> think those
>> >> >> should
>> >> >> stay?  BindArch and the like add a lot of complexity, maybe
>> >> >> there's a
>> >> >> way
>> >> >> to
>> >> >> get rid of those, merging their information into the other
>> >> >> Actions.
>> >> >>
>> >> >> Does that answer your question?  I'm afraid I may be
>> >> >> misunderstanding.
>> >> >
>> >> >
>> >> > What I meant is:
>> >> >
>> >> > In your opinion, if we add support for a new programming model,
>> >> > say
>> >> > OpenMP,
>> >> > should we try to convert CudaAction in something more generic
>> >> > (say
>> >> > DeviceAction) or adding actions for each programming model is
>> >> > the way to
>> >> > go?
>> >> >
>> >> >>
>> >> >>
>> >> >> > I have some application that I've been compiling with clang,
>> >> >> > and I
>> >> >> > usually
>> >> >> > just run "make". Now I read somewhere that a new release of
>> >> >> > clang has
>> >> >> > support for CUDA and I happen to have a nice loop that I
>> >> >> > could
>> >> >> > implement
>> >> >> > with
>> >> >> > CUDA. So, I add a new file with the new implementation, then
>> >> >> > I run
>> >> >> > "make", it
>> >> >> > compiles but when I run it crashes. The reason it crashes is
>> >> >> > that I
>> >> >> > was
>> >> >> > using
>> >> >> > separate compilation and know I need to change all my
>> >> >> > makefile rules
>> >> >> > to
>> >> >> > forward a new kind of file, that I may not even know what it
>> >> >> > is.
>> >> >>
>> >> >> Again, I do not think that we should make up new file formats
>> >> >> and
>> >> >> incorporate
>> >> >> them into clang so that people can use new compiler features
>> >> >> without
>> >> >> modifying
>> >> >> their makefiles.
>> >> >>
>> >> >> I think it is far more important that low-level tools such as
>> >> >> ld and
>> >> >> objdump
>> >> >> continue to work on the files that the compiler outputs.  That
>> >> >> likely
>> >> >> means
>> >> >> we'll have to output N separate files, one for the host and one
>> >> >> for
>> >> >> each
>> >> >> device
>> >> >> arch.
>> >> >>
>> >> >> But hey, this is just my opinion, and I'm a nobody here.  No
>> >> >> offense
>> >> >> taken
>> >> >> if
>> >> >> the community decides otherwise.
>> >> >>
>> >> >
>> >> > Sure, I understand! :) I think you brought some valid points to
>> >> > the
>> >> > discussion given that you probably had to thing carefully about
>> >> > these
>> >> > things
>> >> > in your CUDA work, so I really appreciate you taking some of
>> >> > your time
>> >> > to
>> >> > engage in this discussion.
>> >> >
>> >> > I agree with mostly of what you said regarding the file format.
>> >> > We
>> >> > should
>> >> > rely on existing formats if possible.
>> >> >
>> >> > So, for the example you just stated. How would generate two
>> >> > separate
>> >> > files
>> >> > help ld? Ld won't know how to combine them unless there is a
>> >> > driver that
>> >> > understands offloading that tells it how to do it.
>> >> >
>> >> > My goal here is not to say that doing it one way or the other is
>> >> > right
>> >> > or
>> >> > wrong, just trying to fully understand what bases your opinion.
>> >> >
>> >> > Thanks again!
>> >> > Samuel
>> >> >
>> >> >>
>> >> >> On Fri, Mar 4, 2016 at 2:14 PM, Samuel F Antao
>> >> >> <sfantao at us.ibm.com>
>> >> >> wrote:
>> >> >> >
>> >> >> >
>> >> >> > 2016-03-04 14:40 GMT-05:00 Justin Lebar via cfe-dev
>> >> >> > <cfe-dev at lists.llvm.org>:
>> >> >> >>
>> >> >> >> > If, as you say, building the Action graph for CUDA and
>> >> >> >> > OpenMP is
>> >> >> >> > complicated, I think we should fix that.
>> >> >> >>
>> >> >> >> It occurs to me that perhaps all you want is to build up the
>> >> >> >> Action
>> >> >> >> graph in a non-language-specific manner, and then pass that
>> >> >> >> to e.g.
>> >> >> >> CUDA-specific code that will massage the Action graph into
>> >> >> >> what it
>> >> >> >> wants.
>> >> >> >>
>> >> >> >> I don't know if that would be an improvement over the
>> >> >> >> current
>> >> >> >> situation -- there are a lot of edge cases -- but it might.
>> >> >> >
>> >> >> >
>> >> >> > That's a possible approach. Could be a good way to organize
>> >> >> > it.
>> >> >> > However,
>> >> >> > if
>> >> >> > you have two different programming models those
>> >> >> > transformations would
>> >> >> > happen
>> >> >> > in a given sequence, so the one that comes last will have to
>> >> >> > be aware
>> >> >> > of
>> >> >> > the
>> >> >> > programming model that was used for the first transformation.
>> >> >> > This
>> >> >> > wouldn't
>> >> >> > be as clean as having the host actions (which are always the
>> >> >> > same for
>> >> >> > a
>> >> >> > given file and options) and have all the job generation to
>> >> >> > orbit
>> >> >> > around
>> >> >> > that.
>> >> >> >
>> >> >> > Let me study the problem of doing this with actions and see
>> >> >> > all the
>> >> >> > possible
>> >> >> > implications.
>> >> >> >
>> >> >> >>
>> >> >> >>
>> >> >> >> On Fri, Mar 4, 2016 at 11:34 AM, Justin Lebar
>> >> >> >> <jlebar at google.com>
>> >> >> >> wrote:
>> >> >> >> >> This has two objectives. One is to avoid the creation of
>> >> >> >> >> actions
>> >> >> >> >> that
>> >> >> >> >> are programming model specific. The other is to remove
>> >> >> >> >> complexity
>> >> >> >> >> from the
>> >> >> >> >> action creation that would have to mix phases and
>> >> >> >> >> different
>> >> >> >> >> programming
>> >> >> >> >> models DAG requirements
>> >> >> >> >
>> >> >> >> > As I understand this, we're saying that we'll build up an
>> >> >> >> > action
>> >> >> >> > graph, but it is sort of a lie, in that it does not
>> >> >> >> > encapsulate
>> >> >> >> > all
>> >> >> >> > of
>> >> >> >> > the logic we're interested in.  Then, when we convert the
>> >> >> >> > actions
>> >> >> >> > into
>> >> >> >> > jobs, we'll postprocess them using language-specific logic
>> >> >> >> > to make
>> >> >> >> > the
>> >> >> >> > jobs do what we want.
>> >> >> >> >
>> >> >> >> > I am not in favor of this approach, as I understand it.
>> >> >> >> >  Although
>> >> >> >> > I
>> >> >> >> > acknowledge that it would simplify building the Action
>> >> >> >> > graph
>> >> >> >> > itself,
>> >> >> >> > it does so by moving this complexity into a "shadow Action
>> >> >> >> > graph"
>> >> >> >> > --
>> >> >> >> > the DAG that *actually* describes what we're going to do
>> >> >> >> > (which
>> >> >> >> > may
>> >> >> >> > never be explicitly constructed, but still exists in our
>> >> >> >> > minds).
>> >> >> >> > I
>> >> >> >> > don't think this is actually a simplification.
>> >> >> >> >
>> >> >> >> > If, as you say, building the Action graph for CUDA and
>> >> >> >> > OpenMP is
>> >> >> >> > complicated, I think we should fix that.  Then we'll be
>> >> >> >> > able to
>> >> >> >> > continue using our existing tools to e.g. inspect the
>> >> >> >> > Action graph
>> >> >> >> > generated by the driver.
>> >> >> >> >
>> >> >> >> >> I see the driver already as a wrapper, so I don't think
>> >> >> >> >> it is not
>> >> >> >> >> appropriate to use it.
>> >> >> >> >
>> >> >> >> > You and I, being compiler hackers, understand that the
>> >> >> >> > driver is a
>> >> >> >> > wrapper.  However, to a user, the driver is the compiler.
>> >> >> >> >  No
>> >> >> >> > build
>> >> >> >> > system invokes clang -cc1 directly.
>> >> >> >> >
>> >> >> >> >> However, I think the creation of the blob should be done
>> >> >> >> >> by an
>> >> >> >> >> external
>> >> >> >> >> tool, say, as it was a linker.
>> >> >> >> >
>> >> >> >> > Sure, but this isn't the difference I was getting at.
>> >> >> >> >  What I was
>> >> >> >> > trying to say is that the creation of the blob should be
>> >> >> >> > done by a
>> >> >> >> > tool which is external to the compiler *from the
>> >> >> >> > perspective of
>> >> >> >> > the
>> >> >> >> > user*.  Meaning that, the driver should not invoke this
>> >> >> >> > tool.  If
>> >> >> >> > the
>> >> >> >> > user wants it, they can invoke it explicitly (as they
>> >> >> >> > might use
>> >> >> >> > tar
>> >> >> >> > to
>> >> >> >> > bundle their object files).
>> >> >> >> >
>> >> >> >> >> I'd put it in this way: an bundled file should work as a
>> >> >> >> >> normal
>> >> >> >> >> host
>> >> >> >> >> file, regardless of what device code it embeds.
>> >> >> >> >
>> >> >> >> > OK, but this still makes all existing tools useless if I
>> >> >> >> > want to
>> >> >> >> > inspect device code.  If you give me a .o file and tell me
>> >> >> >> > that
>> >> >> >> > it's
>> >> >> >> > device code, I can inspect it, disassemble it, or whatever
>> >> >> >> > using
>> >> >> >> > existing tools.  If it's a bundle in a file format we made
>> >> >> >> > up here
>> >> >> >> > on
>> >> >> >> > this list, there's very little chance existing tools are
>> >> >> >> > going to
>> >> >> >> > let
>> >> >> >> > me get the device code out in a sensible way.
>> >> >> >> >
>> >> >> >> > Again, I don't think that inventing file formats --
>> >> >> >> > however simple
>> >> >> >> > --
>> >> >> >> > is a business that we should be getting into.
>> >> >> >> >
>> >> >> >> >> Even for ELF, I agree putting the code in some section is
>> >> >> >> >> more
>> >> >> >> >> elegant.
>> >> >> >> >> I'll investigate the possibilities to implement that.
>> >> >> >> >
>> >> >> >> > Maybe, but unless there's a way to annotate that section
>> >> >> >> > and say
>> >> >> >> > "this
>> >> >> >> > section contains code for architecture foo", then objdump
>> >> >> >> > isn't
>> >> >> >> > going
>> >> >> >> > to work sensibly on that section, and I think that's
>> >> >> >> > basically
>> >> >> >> > game
>> >> >> >> > over.
>> >> >> >> >
>> >> >> >> >> In other side, we have text files. My opinion is that we
>> >> >> >> >> should
>> >> >> >> >> have
>> >> >> >> >> something that is easy to read and edit. How would a
>> >> >> >> >> bundled text
>> >> >> >> >> file look
>> >> >> >> >> like in your opinion?
>> >> >> >> >
>> >> >> >> > Similarly, this will not interoperate with any existing
>> >> >> >> > tools, and
>> >> >> >> > I
>> >> >> >> > think that's job zero.
>> >> >> >> >
>> >> >> >> > On Fri, Mar 4, 2016 at 11:06 AM, Samuel F Antao
>> >> >> >> > <sfantao at us.ibm.com>
>> >> >> >> > wrote:
>> >> >> >> >> Hi Justin,
>> >> >> >> >>
>> >> >> >> >> It's great to have your feedback!
>> >> >> >> >>
>> >> >> >> >> 2016-03-03 17:09 GMT-05:00 Justin Lebar via cfe-dev
>> >> >> >> >> <cfe-dev at lists.llvm.org>:
>> >> >> >> >>>
>> >> >> >> >>> Hi, I'm one of the people working on CUDA in clang.
>> >> >> >> >>>
>> >> >> >> >>> In general I agree that the support for CUDA today is
>> >> >> >> >>> rather
>> >> >> >> >>> ad-hoc;
>> >> >> >> >>> it
>> >> >> >> >>> can
>> >> >> >> >>> likely be improved.  However, there are many points in
>> >> >> >> >>> this
>> >> >> >> >>> proposal
>> >> >> >> >>> that
>> >> >> >> >>> I do
>> >> >> >> >>> not understand.  Inasmuch as I think I understand it, I
>> >> >> >> >>> am
>> >> >> >> >>> concerned
>> >> >> >> >>> that
>> >> >> >> >>> it's
>> >> >> >> >>> adding a new abstractions instead of fixing the existing
>> >> >> >> >>> ones,
>> >> >> >> >>> and
>> >> >> >> >>> that
>> >> >> >> >>> this
>> >> >> >> >>> will result in a lot of additional complexity.
>> >> >> >> >>>
>> >> >> >> >>> > a) Create toolchains for host and offload devices
>> >> >> >> >>> > before
>> >> >> >> >>> > creating
>> >> >> >> >>> > the
>> >> >> >> >>> > actions.
>> >> >> >> >>> >
>> >> >> >> >>> > The driver has to detect the employed programming
>> >> >> >> >>> > models
>> >> >> >> >>> > through
>> >> >> >> >>> > the
>> >> >> >> >>> > provided
>> >> >> >> >>> > options (e.g. -fcuda or -fopenmp) or file extensions.
>> >> >> >> >>> > For each
>> >> >> >> >>> > host
>> >> >> >> >>> > and
>> >> >> >> >>> > offloading device and programming model, it should
>> >> >> >> >>> > create a
>> >> >> >> >>> > toolchain.
>> >> >> >> >>>
>> >> >> >> >>> Seems sane to me.
>> >> >> >> >>>
>> >> >> >> >>> > b) Keep the generation of Actions independent of the
>> >> >> >> >>> > program
>> >> >> >> >>> > model.
>> >> >> >> >>> >
>> >> >> >> >>> > In my view, the Actions should only depend on the
>> >> >> >> >>> > compile
>> >> >> >> >>> > phases
>> >> >> >> >>> > requested by
>> >> >> >> >>> > the user and the file extensions of the input files.
>> >> >> >> >>> > Only the
>> >> >> >> >>> > way
>> >> >> >> >>> > those
>> >> >> >> >>> > actions are interpreted to create jobs should be
>> >> >> >> >>> > dependent on
>> >> >> >> >>> > the
>> >> >> >> >>> > programming
>> >> >> >> >>> > model.  This would avoid complicating the actions
>> >> >> >> >>> > creation
>> >> >> >> >>> > with
>> >> >> >> >>> > dependencies
>> >> >> >> >>> > that only make sense to some programming models, which
>> >> >> >> >>> > would
>> >> >> >> >>> > make
>> >> >> >> >>> > the
>> >> >> >> >>> > implementation hard to scale when new programming
>> >> >> >> >>> > models are
>> >> >> >> >>> > to
>> >> >> >> >>> > be
>> >> >> >> >>> > adopted.
>> >> >> >> >>>
>> >> >> >> >>> I don't quite understand what you're proposing here, or
>> >> >> >> >>> what
>> >> >> >> >>> you're
>> >> >> >> >>> trying
>> >> >> >> >>> to
>> >> >> >> >>> accomplish with this change.
>> >> >> >> >>>
>> >> >> >> >>> Perhaps it would help if you could give a concrete
>> >> >> >> >>> example of
>> >> >> >> >>> how
>> >> >> >> >>> this
>> >> >> >> >>> would
>> >> >> >> >>> change e.g. CUDA or Mac universal binary compilation?
>> >> >> >> >>>
>> >> >> >> >>> For example, in CUDA compilation, we have an action
>> >> >> >> >>> which says
>> >> >> >> >>> "compile
>> >> >> >> >>> everything below here as cuda arch sm_35".  sm_35 comes
>> >> >> >> >>> from a
>> >> >> >> >>> command-line
>> >> >> >> >>> flag, so as I understand your proposal, this could not
>> >> >> >> >>> be in the
>> >> >> >> >>> action
>> >> >> >> >>> graph,
>> >> >> >> >>> because it doesn't come from the filename or the compile
>> >> >> >> >>> phases
>> >> >> >> >>> requested
>> >> >> >> >>> by
>> >> >> >> >>> the user.  So, how will we express this notion that some
>> >> >> >> >>> actions
>> >> >> >> >>> should be
>> >> >> >> >>> compiled for a particular arch?
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> This has two objectives. One is to avoid the creation of
>> >> >> >> >> actions
>> >> >> >> >> that
>> >> >> >> >> are
>> >> >> >> >> programming model specific. The other is to remove
>> >> >> >> >> complexity
>> >> >> >> >> from
>> >> >> >> >> the
>> >> >> >> >> action creation that would have to mix phases and
>> >> >> >> >> different
>> >> >> >> >> programming
>> >> >> >> >> models DAG requirements - currently CUDA only requires
>> >> >> >> >> one single
>> >> >> >> >> dependency
>> >> >> >> >> but if you have more programming models with different
>> >> >> >> >> requirements
>> >> >> >> >> and
>> >> >> >> >> add
>> >> >> >> >> separate compilation on top of that, the action
>> >> >> >> >> generation will
>> >> >> >> >> become
>> >> >> >> >> complex and hard to scale. Just to clarify, I am not
>> >> >> >> >> saying that
>> >> >> >> >> creating
>> >> >> >> >> actions for each programming model won't work, I just
>> >> >> >> >> thing that
>> >> >> >> >> doing
>> >> >> >> >> this
>> >> >> >> >> differently will ensure that adding new programming
>> >> >> >> >> models will
>> >> >> >> >> be
>> >> >> >> >> less
>> >> >> >> >> disruptive as the programming model specifics will be
>> >> >> >> >> contained
>> >> >> >> >> in a
>> >> >> >> >> single
>> >> >> >> >> place.
>> >> >> >> >>
>> >> >> >> >> The way I see it is that an action just packs some
>> >> >> >> >> information
>> >> >> >> >> processed
>> >> >> >> >> from a bunch of input info. However, creating an action
>> >> >> >> >> specific
>> >> >> >> >> for
>> >> >> >> >> a
>> >> >> >> >> programming model does not prevent you from having to
>> >> >> >> >> have
>> >> >> >> >> dedicated
>> >> >> >> >> logic
>> >> >> >> >> to deal with  it when the jobs are created. So, given
>> >> >> >> >> that the
>> >> >> >> >> input
>> >> >> >> >> info
>> >> >> >> >> that results in an action is also available when the jobs
>> >> >> >> >> are
>> >> >> >> >> created,
>> >> >> >> >> what
>> >> >> >> >> I propose it to do all the programming model specifics in
>> >> >> >> >> a
>> >> >> >> >> single
>> >> >> >> >> place. We
>> >> >> >> >> already have a cache of results in the jobs builder that
>> >> >> >> >> could
>> >> >> >> >> help
>> >> >> >> >> navigate
>> >> >> >> >> the dependences and, even better, the queries this cache
>> >> >> >> >> can
>> >> >> >> >> provide
>> >> >> >> >> can be
>> >> >> >> >> completely agnostic of the programming model.
>> >> >> >> >>
>> >> >> >> >> Let me try to give you an example on how this proposal
>> >> >> >> >> would
>> >> >> >> >> affect
>> >> >> >> >> CUDA:
>> >> >> >> >>
>> >> >> >> >> - Lets assume that the actions are generated the same way
>> >> >> >> >> they
>> >> >> >> >> are
>> >> >> >> >> for
>> >> >> >> >> the
>> >> >> >> >> host. And that we already have in the driver the host
>> >> >> >> >> toolchain
>> >> >> >> >> and
>> >> >> >> >> also the
>> >> >> >> >> nvptx toolchain, each marked with a new toolchain kind
>> >> >> >> >> "CUDA"
>> >> >> >> >> (these
>> >> >> >> >> toolchain were inferred from the options used to invoke
>> >> >> >> >> the
>> >> >> >> >> driver
>> >> >> >> >> and/or
>> >> >> >> >> file extensions).
>> >> >> >> >>
>> >> >> >> >> - The jobs start to be created for the host as usual.
>> >> >> >> >>
>> >> >> >> >> - Before the any job is constructed there would be a
>> >> >> >> >> post-processing
>> >> >> >> >> of
>> >> >> >> >> the
>> >> >> >> >> results, so that extra results could be appended if
>> >> >> >> >> required by
>> >> >> >> >> the
>> >> >> >> >> programming model.
>> >> >> >> >>
>> >> >> >> >> - This is what would happen in the post-processing
>> >> >> >> >> function:
>> >> >> >> >> {
>> >> >> >> >>   if (!isThisCUDAHostToolChain)
>> >> >> >> >>     return;
>> >> >> >> >>
>> >> >> >> >>   if (!ActionIsCompile)
>> >> >> >> >>     return;
>> >> >> >> >>
>> >> >> >> >>   if (InputActionDependence.type != TY_CUDA)
>> >> >> >> >>     return;
>> >> >> >> >>
>> >> >> >> >>   //Make checks currently in buildCudaActions()
>> >> >> >> >>
>> >> >> >> >>   DevTC = getDeviceToolChainOfKind(CUDA);
>> >> >> >> >>   Action *Asm =
>> >> >> >> >>   CachedResults().giveMeDependentAsmAction();
>> >> >> >> >>
>> >> >> >> >>   for (c : CUDAComputeCapabilities ) {
>> >> >> >> >>     NewResult = BuildJobsForAction(DevTC, Asm)
>> >> >> >> >>     // Or maybe better
>> >> >> >> >>     NewResult = BuildJobsForAction(DevTC,
>> >> >> >> >>     LinkAction(Asm))
>> >> >> >> >>
>> >> >> >> >>     Results.push_back(NewResult);
>> >> >> >> >>   }
>> >> >> >> >> }
>> >> >> >> >>
>> >> >> >> >> CachedResults would offer some extra functionality that
>> >> >> >> >> is not
>> >> >> >> >> programming
>> >> >> >> >> model specific, and this would provide the same
>> >> >> >> >> functionality the
>> >> >> >> >> CUDA
>> >> >> >> >> action is providing. Adding a new programming model would
>> >> >> >> >> only
>> >> >> >> >> require
>> >> >> >> >> adding an instance of this post-process ( apart from the
>> >> >> >> >> creation
>> >> >> >> >> of
>> >> >> >> >> the
>> >> >> >> >> toolchains that would occur before anything starts to be
>> >> >> >> >> done).
>> >> >> >> >>
>> >> >> >> >> I agree these things are complicated to fully
>> >> >> >> >> understand/explain
>> >> >> >> >> based
>> >> >> >> >> a
>> >> >> >> >> summary in a email. I'll try to come up with a
>> >> >> >> >> proposal-patch
>> >> >> >> >> early
>> >> >> >> >> next
>> >> >> >> >> week so that we have something more concrete to discuss.
>> >> >> >> >>
>> >> >> >> >>>
>> >> >> >> >>>
>> >> >> >> >>> > c) Use unbundling and bundling tools agnostic of the
>> >> >> >> >>> > programming
>> >> >> >> >>> > model.
>> >> >> >> >>> >
>> >> >> >> >>> > I propose a single change in the action creation and
>> >> >> >> >>> > that is
>> >> >> >> >>> > the
>> >> >> >> >>> > creation of
>> >> >> >> >>> > a “unbundling” and "bundling” action whose goal is to
>> >> >> >> >>> > prevent
>> >> >> >> >>> > the
>> >> >> >> >>> > user
>> >> >> >> >>> > to
>> >> >> >> >>> > have to deal with multiple files generated from
>> >> >> >> >>> > multiple
>> >> >> >> >>> > toolchains
>> >> >> >> >>> > (host
>> >> >> >> >>> > toolchain and offloading devices’ toolchains) if he
>> >> >> >> >>> > uses
>> >> >> >> >>> > separate
>> >> >> >> >>> > compilation
>> >> >> >> >>> > in his build system.
>> >> >> >> >>>
>> >> >> >> >>> I'm not sure I understand what "separate compilation" is
>> >> >> >> >>> here.
>> >> >> >> >>> Do
>> >> >> >> >>> you
>> >> >> >> >>> mean, a
>> >> >> >> >>> compilation strategy which outputs logically separate
>> >> >> >> >>> machine
>> >> >> >> >>> code
>> >> >> >> >>> for
>> >> >> >> >>> each
>> >> >> >> >>> architecture, only to have this code combined at link
>> >> >> >> >>> time?  (In
>> >> >> >> >>> contrast
>> >> >> >> >>> to
>> >> >> >> >>> how we currently compile CUDA, where the device code for
>> >> >> >> >>> a file
>> >> >> >> >>> is
>> >> >> >> >>> integrated
>> >> >> >> >>> into the host code for that file at compile time?)
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> That's correct. With separate compilation I also mean the
>> >> >> >> >> ability
>> >> >> >> >> to
>> >> >> >> >> link
>> >> >> >> >> device side code, using a device linker (nvlink for
>> >> >> >> >> CUDA).
>> >> >> >> >>
>> >> >> >> >>>
>> >> >> >> >>> If that's right, then what I understand you're proposing
>> >> >> >> >>> here is
>> >> >> >> >>> that,
>> >> >> >> >>> instead
>> >> >> >> >>> of outputting N different object files -- one for the
>> >> >> >> >>> host, and
>> >> >> >> >>> N-1
>> >> >> >> >>> for
>> >> >> >> >>> all our
>> >> >> >> >>> device architectures -- we'd just output one blob which
>> >> >> >> >>> clang
>> >> >> >> >>> would
>> >> >> >> >>> understand
>> >> >> >> >>> how to handle.
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> Correct.
>> >> >> >> >>
>> >> >> >> >>>
>> >> >> >> >>>
>> >> >> >> >>> For my part, I am highly wary of introducing a new file
>> >> >> >> >>> format
>> >> >> >> >>> into
>> >> >> >> >>> clang's
>> >> >> >> >>> output.  Historically, clang (along with other
>> >> >> >> >>> compilers) does
>> >> >> >> >>> not
>> >> >> >> >>> output
>> >> >> >> >>> proprietary blobs.  Instead, we output object files in
>> >> >> >> >>> well-understood,
>> >> >> >> >>> interoperable formats, such as ELF.  This is beneficial
>> >> >> >> >>> because
>> >> >> >> >>> there
>> >> >> >> >>> are
>> >> >> >> >>> lots
>> >> >> >> >>> of existing tools which can handle these files.  It also
>> >> >> >> >>> allows
>> >> >> >> >>> e.g.
>> >> >> >> >>> code
>> >> >> >> >>> compiled with clang to be linked with g++.
>> >> >> >> >>>
>> >> >> >> >>> Build tools are universally awful, and I sympathize with
>> >> >> >> >>> the
>> >> >> >> >>> urge
>> >> >> >> >>> not
>> >> >> >> >>> to
>> >> >> >> >>> change
>> >> >> >> >>> them.  But I don't think this is a business we want the
>> >> >> >> >>> compiler
>> >> >> >> >>> to
>> >> >> >> >>> be
>> >> >> >> >>> in.
>> >> >> >> >>> Instead, if a user wants this kind of "fat object file",
>> >> >> >> >>> they
>> >> >> >> >>> could
>> >> >> >> >>> obtain
>> >> >> >> >>> one
>> >> >> >> >>> by using a simple wrapper around clang.  If this
>> >> >> >> >>> wrapper's
>> >> >> >> >>> output
>> >> >> >> >>> format
>> >> >> >> >>> became
>> >> >> >> >>> widely-used, we could then consider supporting it
>> >> >> >> >>> directly
>> >> >> >> >>> within
>> >> >> >> >>> clang,
>> >> >> >> >>> but
>> >> >> >> >>> that's a proposition for many years in the future.
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> I see the driver already as a wrapper, so I don't think
>> >> >> >> >> it is not
>> >> >> >> >> appropriate to use it. However, I think the creation of
>> >> >> >> >> the blob
>> >> >> >> >> should
>> >> >> >> >> be
>> >> >> >> >> done by an external tool, say, as it was a linker. I have
>> >> >> >> >> an
>> >> >> >> >> initial
>> >> >> >> >> proposal in
>> >> >> >> >>
>> >> >> >> >> http://lists.llvm.org/pipermail/cfe-dev/2016-February/047548.html,
>> >> >> >> >> but
>> >> >> >> >> based
>> >> >> >> >> on your input and also Jonas, I have to rethink a few
>> >> >> >> >> things.
>> >> >> >> >>
>> >> >> >> >> I agree when you say that you would like to have the blob
>> >> >> >> >> working
>> >> >> >> >> well
>> >> >> >> >> with
>> >> >> >> >> other tools. Jonas in some previous email also expressed
>> >> >> >> >>  this
>> >> >> >> >> concern.
>> >> >> >> >> I'd
>> >> >> >> >> put it in this way: an bundled file should work as a
>> >> >> >> >> normal host
>> >> >> >> >> file,
>> >> >> >> >> regardless of what device code it embeds.
>> >> >> >> >>
>> >> >> >> >> For ELF files this works just fine:
>> >> >> >> >>
>> >> >> >> >> clang a.c -c -o a.o
>> >> >> >> >> echo "Some offloading bytes" >> a.o
>> >> >> >> >> clang a.o -o a.out
>> >> >> >> >> a.out
>> >> >> >> >>
>> >> >> >> >> However for other binary formats, we need to wrap in a
>> >> >> >> >> different.
>> >> >> >> >> Even
>> >> >> >> >> for
>> >> >> >> >> ELF, I agree putting the code in some section is more
>> >> >> >> >> elegant.
>> >> >> >> >> I'll
>> >> >> >> >> investigate the possibilities to implement that.
>> >> >> >> >>
>> >> >> >> >> In other side, we have text files. My opinion is that we
>> >> >> >> >> should
>> >> >> >> >> have
>> >> >> >> >> something that is easy to read and edit. How would a
>> >> >> >> >> bundled text
>> >> >> >> >> file
>> >> >> >> >> look
>> >> >> >> >> like in your opinion?
>> >> >> >> >>
>> >> >> >> >> Do you think have all the device code guarded as a
>> >> >> >> >> comment in the
>> >> >> >> >> bottom is
>> >> >> >> >> acceptable? That would work well as a host file.
>> >> >> >> >>
>> >> >> >> >>>
>> >> >> >> >>>
>> >> >> >> >>> > d) Allow the target toolchain to request the host
>> >> >> >> >>> > toolchain to
>> >> >> >> >>> > be
>> >> >> >> >>> > used
>> >> >> >> >>> > for a given action.
>> >> >> >> >>>
>> >> >> >> >>> Seems sane to me.
>> >> >> >> >>>
>> >> >> >> >>> > e)  Use a job results cache to enable sharing results
>> >> >> >> >>> > between
>> >> >> >> >>> > device
>> >> >> >> >>> > and
>> >> >> >> >>> > host toolchains.
>> >> >> >> >>>
>> >> >> >> >>> I don't understand why we need a cache for job results.
>> >> >> >> >>>  Why can
>> >> >> >> >>> we
>> >> >> >> >>> not
>> >> >> >> >>> set up
>> >> >> >> >>> the Action graph such that each node has the correct
>> >> >> >> >>> inputs?
>> >> >> >> >>> (You've
>> >> >> >> >>> actually
>> >> >> >> >>> sketched exactly what I think the Action graph should
>> >> >> >> >>> look like,
>> >> >> >> >>> for
>> >> >> >> >>> CUDA
>> >> >> >> >>> and
>> >> >> >> >>> OpenMP compilations.)
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> I think what I explain above covers this one. If not,
>> >> >> >> >> please let
>> >> >> >> >> me
>> >> >> >> >> know.
>> >> >> >> >> Just to summarize, I'm not saying expressing things in
>> >> >> >> >> Actions
>> >> >> >> >> won't
>> >> >> >> >> work, I
>> >> >> >> >> just think that will be more complex if we have multiple
>> >> >> >> >> programming
>> >> >> >> >> models
>> >> >> >> >> (all potentially used in the same compile) and separate
>> >> >> >> >> compilation
>> >> >> >> >> in
>> >> >> >> >> place. We already have a cache in the jobs builder, I was
>> >> >> >> >> just
>> >> >> >> >> planing
>> >> >> >> >> to
>> >> >> >> >> leverage that.
>> >> >> >> >>
>> >> >> >> >>>
>> >> >> >> >>>
>> >> >> >> >>> > f) Intercept the jobs creation before the emission of
>> >> >> >> >>> > the
>> >> >> >> >>> > command.
>> >> >> >> >>> >
>> >> >> >> >>> > In my view this is the only change required in the
>> >> >> >> >>> > driver
>> >> >> >> >>> > (apart
>> >> >> >> >>> > from
>> >> >> >> >>> > the
>> >> >> >> >>> > obvious toolchain changes) that would be dependent on
>> >> >> >> >>> > the
>> >> >> >> >>> > programming
>> >> >> >> >>> > model.
>> >> >> >> >>> > A job result post-processing function could check that
>> >> >> >> >>> > there
>> >> >> >> >>> > are
>> >> >> >> >>> > offloading
>> >> >> >> >>> > toolchains to be used and spawn the jobs creation for
>> >> >> >> >>> > those
>> >> >> >> >>> > toolchains
>> >> >> >> >>> > as
>> >> >> >> >>> > well as append results from one toolchain to the
>> >> >> >> >>> > results of
>> >> >> >> >>> > some
>> >> >> >> >>> > other
>> >> >> >> >>> > accordingly to the programming model implementation
>> >> >> >> >>> > needs.
>> >> >> >> >>>
>> >> >> >> >>> Again it's not clear to me why we cannot and should not
>> >> >> >> >>> represent
>> >> >> >> >>> this
>> >> >> >> >>> in
>> >> >> >> >>> the
>> >> >> >> >>> Action graph.  It's that graph that's supposed to tell
>> >> >> >> >>> us what
>> >> >> >> >>> we're
>> >> >> >> >>> going
>> >> >> >> >>> to
>> >> >> >> >>> do.
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> I guess  covered this above, if not let me know.
>> >> >> >> >>
>> >> >> >> >>>
>> >> >> >> >>>
>> >> >> >> >>> > g) Reflect the offloading programming model in the
>> >> >> >> >>> > naming of
>> >> >> >> >>> > the
>> >> >> >> >>> > save-temps files.
>> >> >> >> >>>
>> >> >> >> >>> We already do this somewhat; e.g. for CUDA with
>> >> >> >> >>> save-temps,
>> >> >> >> >>> we'll
>> >> >> >> >>> output
>> >> >> >> >>> foo.s
>> >> >> >> >>> and foo-sm_35.s.  Extending this to be more robust (e.g.
>> >> >> >> >>> including
>> >> >> >> >>> the
>> >> >> >> >>> triple)
>> >> >> >> >>> seems fine.
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> Yes, programming model, host/device (in openmp same
>> >> >> >> >> triple can be
>> >> >> >> >> used
>> >> >> >> >> for
>> >> >> >> >> both host and device), and bound arch will make sure we
>> >> >> >> >> get
>> >> >> >> >> unique
>> >> >> >> >> names.
>> >> >> >> >>
>> >> >> >> >>>
>> >> >> >> >>>
>> >> >> >> >>> > h) Use special options -target-offload=<triple> to
>> >> >> >> >>> > specify
>> >> >> >> >>> > offloading
>> >> >> >> >>> > targets and delimit options meant for a toolchain.
>> >> >> >> >>>
>> >> >> >> >>> I think I agree that we should generalize the flags
>> >> >> >> >>> we're using.
>> >> >> >> >>>
>> >> >> >> >>> I'm not sold on the name or structure (I'm not aware of
>> >> >> >> >>> any
>> >> >> >> >>> other
>> >> >> >> >>> flags
>> >> >> >> >>> that
>> >> >> >> >>> affect *all* flags following them?), but we can bikeshed
>> >> >> >> >>> about
>> >> >> >> >>> that
>> >> >> >> >>> separately.
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> I guess we only have -Xblah and friends to change how the
>> >> >> >> >> next
>> >> >> >> >> option
>> >> >> >> >> is
>> >> >> >> >> used. I agree, this is issue is in many ways orthogonal
>> >> >> >> >> to
>> >> >> >> >> everything
>> >> >> >> >> else
>> >> >> >> >> in this proposal, we can address it separately.
>> >> >> >> >>
>> >> >> >> >>>
>> >> >> >> >>>
>> >> >> >> >>> > i) Use the offload kinds in the toolchain to drive the
>> >> >> >> >>> > commands
>> >> >> >> >>> > generation by Tools.
>> >> >> >> >>>
>> >> >> >> >>> I'm not sure exactly what this means, but it doesn't
>> >> >> >> >>> sound
>> >> >> >> >>> particularly contentious.  :)
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> Sorry about that... My explanations get convoluted
>> >> >> >> >> sometimes...
>> >> >> >> >>
>> >> >> >> >> What I mean is that, instead of relying on a file input,
>> >> >> >> >> or
>> >> >> >> >> attributes
>> >> >> >> >> of an
>> >> >> >> >> action, a command can be generated by looking at the
>> >> >> >> >> offloading
>> >> >> >> >> kind
>> >> >> >> >> of
>> >> >> >> >> the
>> >> >> >> >> toolchain.
>> >> >> >> >>
>> >> >> >> >> E.g.
>> >> >> >> >>
>> >> >> >> >> isCuda = isToolChainKind(Toolchain:OFFLOAD_KINDS_CUDA).
>> >> >> >> >>
>> >> >> >> >> or
>> >> >> >> >>
>> >> >> >> >> if(isHostToolChainKind(Toolchain:OFFLOAD_KINDS_CUDA))
>> >> >> >> >>  AuxTriple =
>> >> >> >> >>  getDeviceToolChain(Toolchain:OFFLOAD_KINDS_CUDA)
>> >> >> >> >>
>> >> >> >> >> This would allow a programming model to tune things here
>> >> >> >> >> an
>> >> >> >> >> there.
>> >> >> >> >> Remember,
>> >> >> >> >> that the same toolchain can, in general, be used by
>> >> >> >> >> different
>> >> >> >> >> programming
>> >> >> >> >> models, and simultaneously by host and devices. So being
>> >> >> >> >> able to
>> >> >> >> >> do
>> >> >> >> >> things
>> >> >> >> >> based on a kind simplifies things a lot.
>> >> >> >> >>
>> >> >> >> >>>
>> >> >> >> >>>
>> >> >> >> >>> > 3. We are willing to help with implementation of
>> >> >> >> >>> > CUDA-specific
>> >> >> >> >>> > parts
>> >> >> >> >>> > when
>> >> >> >> >>> > they overlap with the common infrastructure; though we
>> >> >> >> >>> > expect
>> >> >> >> >>> > that
>> >> >> >> >>> > effort to
>> >> >> >> >>> > be driven also by other contributors specifically
>> >> >> >> >>> > interested
>> >> >> >> >>> > in
>> >> >> >> >>> > CUDA
>> >> >> >> >>> > support
>> >> >> >> >>> > that have the necessary know-how (both on CUDA itself
>> >> >> >> >>> > and how
>> >> >> >> >>> > it
>> >> >> >> >>> > is
>> >> >> >> >>> > supported
>> >> >> >> >>> > in Clang / LLVM).
>> >> >> >> >>>
>> >> >> >> >>> Given that this is work that doesn't really help CUDA
>> >> >> >> >>> (the
>> >> >> >> >>> driver
>> >> >> >> >>> works
>> >> >> >> >>> fine
>> >> >> >> >>> for us as-is), I am not sure we'll be able to devote
>> >> >> >> >>> significant
>> >> >> >> >>> resources
>> >> >> >> >>> to
>> >> >> >> >>> this project.  Of course we'll be available to assist
>> >> >> >> >>> with code
>> >> >> >> >>> relevant
>> >> >> >> >>> reviews and give advice.
>> >> >> >> >>>
>> >> >> >> >>> I think like any other change to clang, the
>> >> >> >> >>> responsibility will
>> >> >> >> >>> rest
>> >> >> >> >>> on
>> >> >> >> >>> the
>> >> >> >> >>> authors not to break existing functionality, at the very
>> >> >> >> >>> least
>> >> >> >> >>> inasmuch as
>> >> >> >> >>> is
>> >> >> >> >>> checked by existing unit tests.
>> >> >> >> >>>
>> >> >> >> >>
>> >> >> >> >> Sure, having your feedback/suggestions and help with code
>> >> >> >> >> review
>> >> >> >> >> is
>> >> >> >> >> all
>> >> >> >> >> we
>> >> >> >> >> ask for! We will try not to break anything (and if for
>> >> >> >> >> some
>> >> >> >> >> reason
>> >> >> >> >> we
>> >> >> >> >> do
>> >> >> >> >> will fix it right away). Also, if we find opportunities
>> >> >> >> >> to
>> >> >> >> >> improve
>> >> >> >> >> the
>> >> >> >> >> CUDA
>> >> >> >> >> support we will be happy to contribute that as well.
>> >> >> >> >>
>> >> >> >> >> I hope I addressed the concerns you expressed initially.
>> >> >> >> >> Let me
>> >> >> >> >> know
>> >> >> >> >> any
>> >> >> >> >> other thoughts you may have.
>> >> >> >> >>
>> >> >> >> >> Thanks again!
>> >> >> >> >> Samuel
>> >> >> >> >>
>> >> >> >> >>>
>> >> >> >> >>> Regards,
>> >> >> >> >>> -Justin
>> >> >> >> >>>
>> >> >> >> >>> On Thu, Mar 3, 2016 at 12:03 PM, Samuel F Antao via
>> >> >> >> >>> cfe-dev
>> >> >> >> >>> <cfe-dev at lists.llvm.org> wrote:
>> >> >> >> >>> > Hi Chris,
>> >> >> >> >>> >
>> >> >> >> >>> > I agree with Andrey when he says this should be a
>> >> >> >> >>> > separate
>> >> >> >> >>> > discussion.
>> >> >> >> >>> >
>> >> >> >> >>> > I think that aiming at having a library that would
>> >> >> >> >>> > support any
>> >> >> >> >>> > possible
>> >> >> >> >>> > programming model would take a long time, as it
>> >> >> >> >>> > requires a lot
>> >> >> >> >>> > of
>> >> >> >> >>> > consensus
>> >> >> >> >>> > namely from who is maintaining programming models
>> >> >> >> >>> > already in
>> >> >> >> >>> > clang
>> >> >> >> >>> > (e.g.
>> >> >> >> >>> > CUDA). We should try to have something incremental.
>> >> >> >> >>> >
>> >> >> >> >>> > I'm happy to discuss and know more about the design
>> >> >> >> >>> > and code
>> >> >> >> >>> > you
>> >> >> >> >>> > would
>> >> >> >> >>> > like
>> >> >> >> >>> > to contribute to this, but I think you should post it
>> >> >> >> >>> > in a
>> >> >> >> >>> > different
>> >> >> >> >>> > thread.
>> >> >> >> >>> >
>> >> >> >> >>> > Thanks,
>> >> >> >> >>> > Samuel
>> >> >> >> >>> >
>> >> >> >> >>> > 2016-03-03 11:20 GMT-05:00 C Bergström
>> >> >> >> >>> > <cfe-dev at lists.llvm.org>:
>> >> >> >> >>> >>
>> >> >> >> >>> >> On Thu, Mar 3, 2016 at 10:19 PM, Ronan Keryell
>> >> >> >> >>> >> <ronan at keryell.fr>
>> >> >> >> >>> >> wrote:
>> >> >> >> >>> >> >>>>>> On Thu, 3 Mar 2016 18:19:43 +0700, C Bergström
>> >> >> >> >>> >> >>>>>> via
>> >> >> >> >>> >> >>>>>> cfe-dev
>> >> >> >> >>> >> >>>>>> <cfe-dev at lists.llvm.org> said:
>> >> >> >> >>> >> >
>> >> >> >> >>> >> >     C> On Thu, Mar 3, 2016 at 5:50 PM, Ronan
>> >> >> >> >>> >> >     KERYELL via
>> >> >> >> >>> >> > cfe-dev
>> >> >> >> >>> >> >     C> <cfe-dev at lists.llvm.org> wrote:
>> >> >> >> >>> >> >
>> >> >> >> >>> >> >     >> Just to be sure to understand: you are
>> >> >> >> >>> >> >     >> thinking
>> >> >> >> >>> >> > about
>> >> >> >> >>> >> > being
>> >> >> >> >>> >> > able
>> >> >> >> >>> >> >     >> to outline several "languages" at once, such
>> >> >> >> >>> >> >     >> as CUDA
>> >> >> >> >>> >> > *and*
>> >> >> >> >>> >> >     >> OpenMP, right ?
>> >> >> >> >>> >> >     >>
>> >> >> >> >>> >> >     >> I think it is required for serious
>> >> >> >> >>> >> >     >> applications. For
>> >> >> >> >>> >> > example,
>> >> >> >> >>> >> > in
>> >> >> >> >>> >> >     >> the HPC world, it is common to have hybrid
>> >> >> >> >>> >> > multi-node
>> >> >> >> >>> >> >     >> heterogeneous applications that use
>> >> >> >> >>> >> > MPI+OpenMP+OpenCL
>> >> >> >> >>> >> > for
>> >> >> >> >>> >> >     >> example. Since MPI and OpenCL are just
>> >> >> >> >>> >> >     >> libraries,
>> >> >> >> >>> >> > there
>> >> >> >> >>> >> > is
>> >> >> >> >>> >> > only
>> >> >> >> >>> >> >     >> OpenMP to off-load here. But if we move to
>> >> >> >> >>> >> >     >> OpenCL
>> >> >> >> >>> >> > SYCL
>> >> >> >> >>> >> > instead
>> >> >> >> >>> >> >     >> with MPI+OpenMP+SYCL then both OpenMP and
>> >> >> >> >>> >> >     >> SYCL have
>> >> >> >> >>> >> > to
>> >> >> >> >>> >> > be
>> >> >> >> >>> >> > managed
>> >> >> >> >>> >> >     >> by the Clang off-loading infrastructure at
>> >> >> >> >>> >> >     >> the same
>> >> >> >> >>> >> > time
>> >> >> >> >>> >> > and
>> >> >> >> >>> >> > be
>> >> >> >> >>> >> >     >> sure they combine gracefully...
>> >> >> >> >>> >> >     >>
>> >> >> >> >>> >> >     >> I think your second proposal about
>> >> >> >> >>> >> >     >> (un)bundling can
>> >> >> >> >>> >> > already
>> >> >> >> >>> >> >     >> manage this.
>> >> >> >> >>> >> >     >>
>> >> >> >> >>> >> >     >> Otherwise, what about the code outlining
>> >> >> >> >>> >> >     >> itself used
>> >> >> >> >>> >> > in
>> >> >> >> >>> >> > the
>> >> >> >> >>> >> >     >> off-loading process? The code generation
>> >> >> >> >>> >> >     >> itself
>> >> >> >> >>> >> > requires
>> >> >> >> >>> >> > to
>> >> >> >> >>> >> >     >> outline the kernel code to some external
>> >> >> >> >>> >> >     >> functions
>> >> >> >> >>> >> > to
>> >> >> >> >>> >> > be
>> >> >> >> >>> >> > compiled
>> >> >> >> >>> >> >     >> by the kernel compiler. Do you think it is
>> >> >> >> >>> >> >     >> up to the
>> >> >> >> >>> >> > programmer
>> >> >> >> >>> >> >     >> to re-use the recipes used by OpenMP and
>> >> >> >> >>> >> >     >> CUDA for
>> >> >> >> >>> >> > example
>> >> >> >> >>> >> > or
>> >> >> >> >>> >> > it
>> >> >> >> >>> >> >     >> would be interesting to have a third
>> >> >> >> >>> >> >     >> proposal to
>> >> >> >> >>> >> > abstract
>> >> >> >> >>> >> > more
>> >> >> >> >>> >> >     >> the outliner to be configurable to handle
>> >> >> >> >>> >> >     >> globally
>> >> >> >> >>> >> > OpenMP,
>> >> >> >> >>> >> > CUDA,
>> >> >> >> >>> >> >     >> SYCL...?
>> >> >> >> >>> >> >
>> >> >> >> >>> >> >     C> Some very good points above and back to my
>> >> >> >> >>> >> >     broken
>> >> >> >> >>> >> > record..
>> >> >> >> >>> >> >
>> >> >> >> >>> >> >     C> If all offloading is done in a single
>> >> >> >> >>> >> >     unified
>> >> >> >> >>> >> > library -
>> >> >> >> >>> >> >     C> a. Lowering in LLVM is greatly simplified
>> >> >> >> >>> >> >     since
>> >> >> >> >>> >> > there's
>> >> >> >> >>> >> > ***1***
>> >> >> >> >>> >> >     C> offload API to be supported A region that's
>> >> >> >> >>> >> >     outlined
>> >> >> >> >>> >> > for
>> >> >> >> >>> >> > SYCL,
>> >> >> >> >>> >> >     C> CUDA or something else is essentially the
>> >> >> >> >>> >> >     same
>> >> >> >> >>> >> > thing.
>> >> >> >> >>> >> > (I
>> >> >> >> >>> >> > do
>> >> >> >> >>> >> >     C> realize that some transformation may be
>> >> >> >> >>> >> >     highly
>> >> >> >> >>> >> > target
>> >> >> >> >>> >> > specific,
>> >> >> >> >>> >> >     C> but to me that's more target hw driven than
>> >> >> >> >>> >> > programming
>> >> >> >> >>> >> > model
>> >> >> >> >>> >> >     C> driven)
>> >> >> >> >>> >> >
>> >> >> >> >>> >> >     C> b. Mixing CUDA/OMP/ACC/Foo in theory may
>> >> >> >> >>> >> >     "just work"
>> >> >> >> >>> >> > since
>> >> >> >> >>> >> > the
>> >> >> >> >>> >> >     C> same runtime will handle them all. (With the
>> >> >> >> >>> >> > limitation
>> >> >> >> >>> >> > that
>> >> >> >> >>> >> > if
>> >> >> >> >>> >> >     C> you want CUDA to *talk to* OMP or something
>> >> >> >> >>> >> >     else
>> >> >> >> >>> >> > there
>> >> >> >> >>> >> > needs
>> >> >> >> >>> >> > to
>> >> >> >> >>> >> >     C> be some glue.  I'm merely saying that 1
>> >> >> >> >>> >> >     application
>> >> >> >> >>> >> > with
>> >> >> >> >>> >> > multiple
>> >> >> >> >>> >> >     C> models in a way that won't conflict)
>> >> >> >> >>> >> >
>> >> >> >> >>> >> >     C> c. The driver doesn't need to figure out do
>> >> >> >> >>> >> >     I link
>> >> >> >> >>> >> > against
>> >> >> >> >>> >> > some
>> >> >> >> >>> >> >     C> or a multitude of combining/conflicting
>> >> >> >> >>> >> >     libcuda,
>> >> >> >> >>> >> > libomp,
>> >> >> >> >>> >> >     C> libsomething - it's liboffload - done
>> >> >> >> >>> >> >
>> >> >> >> >>> >> > Yes, a unified target library would help.
>> >> >> >> >>> >> >
>> >> >> >> >>> >> >     C> The driver proposal and the liboffload
>> >> >> >> >>> >> >     proposal
>> >> >> >> >>> >> > should
>> >> >> >> >>> >> > imnsho
>> >> >> >> >>> >> > be
>> >> >> >> >>> >> >     C> tightly coupled and work together as *1*.
>> >> >> >> >>> >> >     The goals
>> >> >> >> >>> >> > are
>> >> >> >> >>> >> >     C> significantly overlapping and relevant. If
>> >> >> >> >>> >> >     you get
>> >> >> >> >>> >> > the
>> >> >> >> >>> >> > liboffload
>> >> >> >> >>> >> >     C> OMP people to make that more agnostic - I
>> >> >> >> >>> >> >     think it
>> >> >> >> >>> >> > simplifies
>> >> >> >> >>> >> > the
>> >> >> >> >>> >> >     C> driver work.
>> >> >> >> >>> >> >
>> >> >> >> >>> >> > So basically it is about introducing a fourth
>> >> >> >> >>> >> > unification:
>> >> >> >> >>> >> > liboffload.
>> >> >> >> >>> >> >
>> >> >> >> >>> >> > A great unification sounds great.
>> >> >> >> >>> >> > My only concern is that if we tie everything
>> >> >> >> >>> >> > together, it
>> >> >> >> >>> >> > would
>> >> >> >> >>> >> > increase
>> >> >> >> >>> >> > the entry cost: all the different components should
>> >> >> >> >>> >> > be
>> >> >> >> >>> >> > ready
>> >> >> >> >>> >> > in
>> >> >> >> >>> >> > lock-step.
>> >> >> >> >>> >> > If there is already a runtime available, it would
>> >> >> >> >>> >> > be easier
>> >> >> >> >>> >> > to
>> >> >> >> >>> >> > start
>> >> >> >> >>> >> > with and develop the other part in the meantime.
>> >> >> >> >>> >> > So from a pragmatic agile point-of-view, I would
>> >> >> >> >>> >> > prefer not
>> >> >> >> >>> >> > to
>> >> >> >> >>> >> > impose
>> >> >> >> >>> >> > a
>> >> >> >> >>> >> > strong unification.
>> >> >> >> >>> >>
>> >> >> >> >>> >> I think may not be explaining clearly - let me
>> >> >> >> >>> >> elaborate by
>> >> >> >> >>> >> example
>> >> >> >> >>> >> a
>> >> >> >> >>> >> bit
>> >> >> >> >>> >> below
>> >> >> >> >>> >>
>> >> >> >> >>> >> > In the proposal of Samuel, all the parts seem
>> >> >> >> >>> >> > independent.
>> >> >> >> >>> >> >
>> >> >> >> >>> >> >     C>   ------ More specific to this proposal -
>> >> >> >> >>> >> >     device
>> >> >> >> >>> >> >     C> linker vs host linker. What do you do for
>> >> >> >> >>> >> >     IPA/LTO or
>> >> >> >> >>> >> > whole
>> >> >> >> >>> >> >     C> program optimizations? (Outside the scope of
>> >> >> >> >>> >> >     this
>> >> >> >> >>> >> > project.. ?)
>> >> >> >> >>> >> >
>> >> >> >> >>> >> > Ouch. I did not think about it. It sounds like
>> >> >> >> >>> >> > science-fiction
>> >> >> >> >>> >> > for
>> >> >> >> >>> >> > now. :-) Probably outside the scope of this
>> >> >> >> >>> >> > project..
>> >> >> >> >>> >>
>> >> >> >> >>> >> It should certainly not be science fiction or an
>> >> >> >> >>> >> after-thought.
>> >> >> >> >>> >> I
>> >> >> >> >>> >> won't go into shameless self promotion, but there are
>> >> >> >> >>> >> certainly
>> >> >> >> >>> >> useful
>> >> >> >> >>> >> things you can do when you have a "whole device
>> >> >> >> >>> >> kernel"
>> >> >> >> >>> >> perspective.
>> >> >> >> >>> >>
>> >> >> >> >>> >> To digress into the liboffload component of this
>> >> >> >> >>> >> (sorry)
>> >> >> >> >>> >> what we have today is basically liboffload/src/all
>> >> >> >> >>> >> source
>> >> >> >> >>> >> files
>> >> >> >> >>> >> mucked
>> >> >> >> >>> >> together
>> >> >> >> >>> >>
>> >> >> >> >>> >> What I'm proposing would look more like this
>> >> >> >> >>> >>
>> >> >> >> >>> >> liboffload/src/common_middle_layer_glue # to start
>> >> >> >> >>> >> this may
>> >> >> >> >>> >> be
>> >> >> >> >>> >> "best
>> >> >> >> >>> >> effort"
>> >> >> >> >>> >> liboffload/src/omp # This code should exist today,
>> >> >> >> >>> >> but
>> >> >> >> >>> >> ideally
>> >> >> >> >>> >> should
>> >> >> >> >>> >> build on top of the middle layer
>> >> >> >> >>> >> liboffload/src/ptx # this may exist today - not sure
>> >> >> >> >>> >> liboffload/src/amd_gpu # probably doesn't exist, but
>> >> >> >> >>> >> wouldn't/shouldn't block anything
>> >> >> >> >>> >> liboffload/src/phi # may exist in some form
>> >> >> >> >>> >> liboffload/src/cuda # may exist in some form outside
>> >> >> >> >>> >> of the
>> >> >> >> >>> >> OMP
>> >> >> >> >>> >> work
>> >> >> >> >>> >>
>> >> >> >> >>> >> The end result would be liboffload.
>> >> >> >> >>> >>
>> >> >> >> >>> >> Above and below the common middle layer API are
>> >> >> >> >>> >> programming
>> >> >> >> >>> >> model
>> >> >> >> >>> >> or
>> >> >> >> >>> >> hardware specific. To add a new hw backend you just
>> >> >> >> >>> >> implement
>> >> >> >> >>> >> the
>> >> >> >> >>> >> things the middle layer needs. To add a new
>> >> >> >> >>> >> programming model
>> >> >> >> >>> >> you
>> >> >> >> >>> >> build on top of the common layer. I'm not trying to
>> >> >> >> >>> >> force
>> >> >> >> >>> >> anyone/everyone to switch to this now - I'm hoping
>> >> >> >> >>> >> that by
>> >> >> >> >>> >> being
>> >> >> >> >>> >> a
>> >> >> >> >>> >> squeaky wheel this isolation of design and layers is
>> >> >> >> >>> >> there
>> >> >> >> >>> >> from
>> >> >> >> >>> >> the
>> >> >> >> >>> >> start - even if not perfect. I think it's sloppy to
>> >> >> >> >>> >> not
>> >> >> >> >>> >> consider
>> >> >> >> >>> >> this
>> >> >> >> >>> >> actually. LLVM's code generation is clean and has a
>> >> >> >> >>> >> nice
>> >> >> >> >>> >> separation
>> >> >> >> >>> >> per target (for the most part) - why should the
>> >> >> >> >>> >> offload
>> >> >> >> >>> >> library
>> >> >> >> >>> >> have
>> >> >> >> >>> >> bad design which just needs to be refactored later.
>> >> >> >> >>> >> I've seen
>> >> >> >> >>> >> others
>> >> >> >> >>> >> in the community beat up Intel to force them to have
>> >> >> >> >>> >> higher
>> >> >> >> >>> >> quality
>> >> >> >> >>> >> code before inclusion... some of this may actually be
>> >> >> >> >>> >> just
>> >> >> >> >>> >> minor
>> >> >> >> >>> >> refactoring to come close to the target. (No pun
>> >> >> >> >>> >> intended)
>> >> >> >> >>> >> -------------
>> >> >> >> >>> >> If others become open to this design - I'm happy to
>> >> >> >> >>> >> contribute
>> >> >> >> >>> >> more
>> >> >> >> >>> >> tangible details on the actual middle API.
>> >> >> >> >>> >>
>> >> >> >> >>> >> the objects which the driver has to deal with may and
>> >> >> >> >>> >> probably
>> >> >> >> >>> >> do
>> >> >> >> >>> >> overlap to some extent with the objects the
>> >> >> >> >>> >> liboffload has to
>> >> >> >> >>> >> load
>> >> >> >> >>> >> or
>> >> >> >> >>> >> deal with. Is there an API the driver can hook into
>> >> >> >> >>> >> to
>> >> >> >> >>> >> magically
>> >> >> >> >>> >> handle that or is it all per-device and 1-off..
>> >> >> >> >>> >> _______________________________________________
>> >> >> >> >>> >> cfe-dev mailing list
>> >> >> >> >>> >> cfe-dev at lists.llvm.org
>> >> >> >> >>> >> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>> >> >> >> >>> >
>> >> >> >> >>> >
>> >> >> >> >>> >
>> >> >> >> >>> > _______________________________________________
>> >> >> >> >>> > cfe-dev mailing list
>> >> >> >> >>> > cfe-dev at lists.llvm.org
>> >> >> >> >>> > http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>> >> >> >> >>> >
>> >> >> >> >>> _______________________________________________
>> >> >> >> >>> cfe-dev mailing list
>> >> >> >> >>> cfe-dev at lists.llvm.org
>> >> >> >> >>> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> _______________________________________________
>> >> >> >> cfe-dev mailing list
>> >> >> >> cfe-dev at lists.llvm.org
>> >> >> >> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>> >> >> >
>> >> >> >
>> >> >> _______________________________________________
>> >> >> cfe-dev mailing list
>> >> >> cfe-dev at lists.llvm.org
>> >> >> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>> >> >
>> >> >
>> >> _______________________________________________
>> >> cfe-dev mailing list
>> >> cfe-dev at lists.llvm.org
>> >> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>> >
>> >
>> _______________________________________________
>> cfe-dev mailing list
>> cfe-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>
>
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory