[cfe-dev] [RFC][OpenMP][CUDA] Unified Offloading Support in Clang Driver

Fri Mar 4 14:20:52 PST 2016

> So, in your opinion, should we create an action for each programing model or
> should we have a generic one?

We currently have generic Actions, like "CompileAction".  I think those should
stay?  BindArch and the like add a lot of complexity, maybe there's a way to
get rid of those, merging their information into the other Actions.

Does that answer your question?  I'm afraid I may be misunderstanding.

> I have some application that I've been compiling with clang, and I usually
> just run "make". Now I read somewhere that a new release of clang has
> support for CUDA and I happen to have a nice loop that I could implement with
> CUDA. So, I add a new file with the new implementation, then I run "make", it
> compiles but when I run it crashes. The reason it crashes is that I was using
> separate compilation and know I need to change all my makefile rules to
> forward a new kind of file, that I may not even know what it is.

Again, I do not think that we should make up new file formats and incorporate
them into clang so that people can use new compiler features without modifying
their makefiles.

I think it is far more important that low-level tools such as ld and objdump
continue to work on the files that the compiler outputs.  That likely means
we'll have to output N separate files, one for the host and one for each device
arch.

But hey, this is just my opinion, and I'm a nobody here.  No offense taken if
the community decides otherwise.

On Fri, Mar 4, 2016 at 2:14 PM, Samuel F Antao <sfantao at us.ibm.com> wrote:
>
>
> 2016-03-04 14:40 GMT-05:00 Justin Lebar via cfe-dev
> <cfe-dev at lists.llvm.org>:
>>
>> > If, as you say, building the Action graph for CUDA and OpenMP is
>> > complicated, I think we should fix that.
>>
>> It occurs to me that perhaps all you want is to build up the Action
>> graph in a non-language-specific manner, and then pass that to e.g.
>> CUDA-specific code that will massage the Action graph into what it
>> wants.
>>
>> I don't know if that would be an improvement over the current
>> situation -- there are a lot of edge cases -- but it might.
>
>
> That's a possible approach. Could be a good way to organize it. However, if
> you have two different programming models those transformations would happen
> in a given sequence, so the one that comes last will have to be aware of the
> programming model that was used for the first transformation. This wouldn't
> be as clean as having the host actions (which are always the same for a
> given file and options) and have all the job generation to orbit around
> that.
>
> Let me study the problem of doing this with actions and see all the possible
> implications.
>
>>
>>
>> On Fri, Mar 4, 2016 at 11:34 AM, Justin Lebar <jlebar at google.com> wrote:
>> >> This has two objectives. One is to avoid the creation of actions that
>> >> are programming model specific. The other is to remove complexity from the
>> >> action creation that would have to mix phases and different programming
>> >> models DAG requirements
>> >
>> > As I understand this, we're saying that we'll build up an action
>> > graph, but it is sort of a lie, in that it does not encapsulate all of
>> > the logic we're interested in.  Then, when we convert the actions into
>> > jobs, we'll postprocess them using language-specific logic to make the
>> > jobs do what we want.
>> >
>> > I am not in favor of this approach, as I understand it.  Although I
>> > acknowledge that it would simplify building the Action graph itself,
>> > it does so by moving this complexity into a "shadow Action graph" --
>> > the DAG that *actually* describes what we're going to do (which may
>> > never be explicitly constructed, but still exists in our minds).  I
>> > don't think this is actually a simplification.
>> >
>> > If, as you say, building the Action graph for CUDA and OpenMP is
>> > complicated, I think we should fix that.  Then we'll be able to
>> > continue using our existing tools to e.g. inspect the Action graph
>> > generated by the driver.
>> >
>> >> I see the driver already as a wrapper, so I don't think it is not
>> >> appropriate to use it.
>> >
>> > You and I, being compiler hackers, understand that the driver is a
>> > wrapper.  However, to a user, the driver is the compiler.  No build
>> > system invokes clang -cc1 directly.
>> >
>> >> However, I think the creation of the blob should be done by an external
>> >> tool, say, as it was a linker.
>> >
>> > Sure, but this isn't the difference I was getting at.  What I was
>> > trying to say is that the creation of the blob should be done by a
>> > tool which is external to the compiler *from the perspective of the
>> > user*.  Meaning that, the driver should not invoke this tool.  If the
>> > user wants it, they can invoke it explicitly (as they might use tar to
>> > bundle their object files).
>> >
>> >> I'd put it in this way: an bundled file should work as a normal host
>> >> file, regardless of what device code it embeds.
>> >
>> > OK, but this still makes all existing tools useless if I want to
>> > inspect device code.  If you give me a .o file and tell me that it's
>> > device code, I can inspect it, disassemble it, or whatever using
>> > existing tools.  If it's a bundle in a file format we made up here on
>> > this list, there's very little chance existing tools are going to let
>> > me get the device code out in a sensible way.
>> >
>> > Again, I don't think that inventing file formats -- however simple --
>> > is a business that we should be getting into.
>> >
>> >> Even for ELF, I agree putting the code in some section is more elegant.
>> >> I'll investigate the possibilities to implement that.
>> >
>> > Maybe, but unless there's a way to annotate that section and say "this
>> > section contains code for architecture foo", then objdump isn't going
>> > to work sensibly on that section, and I think that's basically game
>> > over.
>> >
>> >> In other side, we have text files. My opinion is that we should have
>> >> something that is easy to read and edit. How would a bundled text file look
>> >> like in your opinion?
>> >
>> > Similarly, this will not interoperate with any existing tools, and I
>> > think that's job zero.
>> >
>> > On Fri, Mar 4, 2016 at 11:06 AM, Samuel F Antao <sfantao at us.ibm.com>
>> > wrote:
>> >> Hi Justin,
>> >>
>> >> It's great to have your feedback!
>> >>
>> >> 2016-03-03 17:09 GMT-05:00 Justin Lebar via cfe-dev
>> >> <cfe-dev at lists.llvm.org>:
>> >>>
>> >>> Hi, I'm one of the people working on CUDA in clang.
>> >>>
>> >>> In general I agree that the support for CUDA today is rather ad-hoc;
>> >>> it
>> >>> can
>> >>> likely be improved.  However, there are many points in this proposal
>> >>> that
>> >>> I do
>> >>> not understand.  Inasmuch as I think I understand it, I am concerned
>> >>> that
>> >>> it's
>> >>> adding a new abstractions instead of fixing the existing ones, and
>> >>> that
>> >>> this
>> >>> will result in a lot of additional complexity.
>> >>>
>> >>> > a) Create toolchains for host and offload devices before creating
>> >>> > the
>> >>> > actions.
>> >>> >
>> >>> > The driver has to detect the employed programming models through the
>> >>> > provided
>> >>> > options (e.g. -fcuda or -fopenmp) or file extensions. For each host
>> >>> > and
>> >>> > offloading device and programming model, it should create a
>> >>> > toolchain.
>> >>>
>> >>> Seems sane to me.
>> >>>
>> >>> > b) Keep the generation of Actions independent of the program model.
>> >>> >
>> >>> > In my view, the Actions should only depend on the compile phases
>> >>> > requested by
>> >>> > the user and the file extensions of the input files. Only the way
>> >>> > those
>> >>> > actions are interpreted to create jobs should be dependent on the
>> >>> > programming
>> >>> > model.  This would avoid complicating the actions creation with
>> >>> > dependencies
>> >>> > that only make sense to some programming models, which would make
>> >>> > the
>> >>> > implementation hard to scale when new programming models are to be
>> >>> > adopted.
>> >>>
>> >>> I don't quite understand what you're proposing here, or what you're
>> >>> trying
>> >>> to
>> >>> accomplish with this change.
>> >>>
>> >>> Perhaps it would help if you could give a concrete example of how this
>> >>> would
>> >>> change e.g. CUDA or Mac universal binary compilation?
>> >>>
>> >>> For example, in CUDA compilation, we have an action which says
>> >>> "compile
>> >>> everything below here as cuda arch sm_35".  sm_35 comes from a
>> >>> command-line
>> >>> flag, so as I understand your proposal, this could not be in the
>> >>> action
>> >>> graph,
>> >>> because it doesn't come from the filename or the compile phases
>> >>> requested
>> >>> by
>> >>> the user.  So, how will we express this notion that some actions
>> >>> should be
>> >>> compiled for a particular arch?
>> >>
>> >>
>> >> This has two objectives. One is to avoid the creation of actions that
>> >> are
>> >> programming model specific. The other is to remove complexity from the
>> >> action creation that would have to mix phases and different programming
>> >> models DAG requirements - currently CUDA only requires one single
>> >> dependency
>> >> but if you have more programming models with different requirements and
>> >> add
>> >> separate compilation on top of that, the action generation will become
>> >> complex and hard to scale. Just to clarify, I am not saying that
>> >> creating
>> >> actions for each programming model won't work, I just thing that doing
>> >> this
>> >> differently will ensure that adding new programming models will be less
>> >> disruptive as the programming model specifics will be contained in a
>> >> single
>> >> place.
>> >>
>> >> The way I see it is that an action just packs some information
>> >> processed
>> >> from a bunch of input info. However, creating an action specific for a
>> >> programming model does not prevent you from having to have dedicated
>> >> logic
>> >> to deal with  it when the jobs are created. So, given that the input
>> >> info
>> >> that results in an action is also available when the jobs are created,
>> >> what
>> >> I propose it to do all the programming model specifics in a single
>> >> place. We
>> >> already have a cache of results in the jobs builder that could help
>> >> navigate
>> >> the dependences and, even better, the queries this cache can provide
>> >> can be
>> >> completely agnostic of the programming model.
>> >>
>> >> Let me try to give you an example on how this proposal would affect
>> >> CUDA:
>> >>
>> >> - Lets assume that the actions are generated the same way they are for
>> >> the
>> >> host. And that we already have in the driver the host toolchain and
>> >> also the
>> >> nvptx toolchain, each marked with a new toolchain kind "CUDA" (these
>> >> toolchain were inferred from the options used to invoke the driver
>> >> and/or
>> >> file extensions).
>> >>
>> >> - The jobs start to be created for the host as usual.
>> >>
>> >> - Before the any job is constructed there would be a post-processing of
>> >> the
>> >> results, so that extra results could be appended if required by the
>> >> programming model.
>> >>
>> >> - This is what would happen in the post-processing function:
>> >> {
>> >>   if (!isThisCUDAHostToolChain)
>> >>     return;
>> >>
>> >>   if (!ActionIsCompile)
>> >>     return;
>> >>
>> >>   if (InputActionDependence.type != TY_CUDA)
>> >>     return;
>> >>
>> >>   //Make checks currently in buildCudaActions()
>> >>
>> >>   DevTC = getDeviceToolChainOfKind(CUDA);
>> >>   Action *Asm = CachedResults().giveMeDependentAsmAction();
>> >>
>> >>   for (c : CUDAComputeCapabilities ) {
>> >>     NewResult = BuildJobsForAction(DevTC, Asm)
>> >>     // Or maybe better
>> >>     NewResult = BuildJobsForAction(DevTC, LinkAction(Asm))
>> >>
>> >>     Results.push_back(NewResult);
>> >>   }
>> >> }
>> >>
>> >> CachedResults would offer some extra functionality that is not
>> >> programming
>> >> model specific, and this would provide the same functionality the CUDA
>> >> action is providing. Adding a new programming model would only require
>> >> adding an instance of this post-process ( apart from the creation of
>> >> the
>> >> toolchains that would occur before anything starts to be done).
>> >>
>> >> I agree these things are complicated to fully understand/explain based
>> >> a
>> >> summary in a email. I'll try to come up with a proposal-patch early
>> >> next
>> >> week so that we have something more concrete to discuss.
>> >>
>> >>>
>> >>>
>> >>> > c) Use unbundling and bundling tools agnostic of the programming
>> >>> > model.
>> >>> >
>> >>> > I propose a single change in the action creation and that is the
>> >>> > creation of
>> >>> > a “unbundling” and "bundling” action whose goal is to prevent the
>> >>> > user
>> >>> > to
>> >>> > have to deal with multiple files generated from multiple toolchains
>> >>> > (host
>> >>> > toolchain and offloading devices’ toolchains) if he uses separate
>> >>> > compilation
>> >>> > in his build system.
>> >>>
>> >>> I'm not sure I understand what "separate compilation" is here.  Do you
>> >>> mean, a
>> >>> compilation strategy which outputs logically separate machine code for
>> >>> each
>> >>> architecture, only to have this code combined at link time?  (In
>> >>> contrast
>> >>> to
>> >>> how we currently compile CUDA, where the device code for a file is
>> >>> integrated
>> >>> into the host code for that file at compile time?)
>> >>
>> >>
>> >> That's correct. With separate compilation I also mean the ability to
>> >> link
>> >> device side code, using a device linker (nvlink for CUDA).
>> >>
>> >>>
>> >>> If that's right, then what I understand you're proposing here is that,
>> >>> instead
>> >>> of outputting N different object files -- one for the host, and N-1
>> >>> for
>> >>> all our
>> >>> device architectures -- we'd just output one blob which clang would
>> >>> understand
>> >>> how to handle.
>> >>
>> >>
>> >> Correct.
>> >>
>> >>>
>> >>>
>> >>> For my part, I am highly wary of introducing a new file format into
>> >>> clang's
>> >>> output.  Historically, clang (along with other compilers) does not
>> >>> output
>> >>> proprietary blobs.  Instead, we output object files in
>> >>> well-understood,
>> >>> interoperable formats, such as ELF.  This is beneficial because there
>> >>> are
>> >>> lots
>> >>> of existing tools which can handle these files.  It also allows e.g.
>> >>> code
>> >>> compiled with clang to be linked with g++.
>> >>>
>> >>> Build tools are universally awful, and I sympathize with the urge not
>> >>> to
>> >>> change
>> >>> them.  But I don't think this is a business we want the compiler to be
>> >>> in.
>> >>> Instead, if a user wants this kind of "fat object file", they could
>> >>> obtain
>> >>> one
>> >>> by using a simple wrapper around clang.  If this wrapper's output
>> >>> format
>> >>> became
>> >>> widely-used, we could then consider supporting it directly within
>> >>> clang,
>> >>> but
>> >>> that's a proposition for many years in the future.
>> >>
>> >>
>> >> I see the driver already as a wrapper, so I don't think it is not
>> >> appropriate to use it. However, I think the creation of the blob should
>> >> be
>> >> done by an external tool, say, as it was a linker. I have an initial
>> >> proposal in
>> >> http://lists.llvm.org/pipermail/cfe-dev/2016-February/047548.html, but
>> >> based
>> >> on your input and also Jonas, I have to rethink a few things.
>> >>
>> >> I agree when you say that you would like to have the blob working well
>> >> with
>> >> other tools. Jonas in some previous email also expressed  this concern.
>> >> I'd
>> >> put it in this way: an bundled file should work as a normal host  file,
>> >> regardless of what device code it embeds.
>> >>
>> >> For ELF files this works just fine:
>> >>
>> >> clang a.c -c -o a.o
>> >> echo "Some offloading bytes" >> a.o
>> >> clang a.o -o a.out
>> >> a.out
>> >>
>> >> However for other binary formats, we need to wrap in a different. Even
>> >> for
>> >> ELF, I agree putting the code in some section is more elegant. I'll
>> >> investigate the possibilities to implement that.
>> >>
>> >> In other side, we have text files. My opinion is that we should have
>> >> something that is easy to read and edit. How would a bundled text file
>> >> look
>> >> like in your opinion?
>> >>
>> >> Do you think have all the device code guarded as a comment in the
>> >> bottom is
>> >> acceptable? That would work well as a host file.
>> >>
>> >>>
>> >>>
>> >>> > d) Allow the target toolchain to request the host toolchain to be
>> >>> > used
>> >>> > for a given action.
>> >>>
>> >>> Seems sane to me.
>> >>>
>> >>> > e)  Use a job results cache to enable sharing results between device
>> >>> > and
>> >>> > host toolchains.
>> >>>
>> >>> I don't understand why we need a cache for job results.  Why can we
>> >>> not
>> >>> set up
>> >>> the Action graph such that each node has the correct inputs?  (You've
>> >>> actually
>> >>> sketched exactly what I think the Action graph should look like, for
>> >>> CUDA
>> >>> and
>> >>> OpenMP compilations.)
>> >>
>> >>
>> >> I think what I explain above covers this one. If not, please let me
>> >> know.
>> >> Just to summarize, I'm not saying expressing things in Actions won't
>> >> work, I
>> >> just think that will be more complex if we have multiple programming
>> >> models
>> >> (all potentially used in the same compile) and separate compilation in
>> >> place. We already have a cache in the jobs builder, I was just planing
>> >> to
>> >> leverage that.
>> >>
>> >>>
>> >>>
>> >>> > f) Intercept the jobs creation before the emission of the command.
>> >>> >
>> >>> > In my view this is the only change required in the driver (apart
>> >>> > from
>> >>> > the
>> >>> > obvious toolchain changes) that would be dependent on the
>> >>> > programming
>> >>> > model.
>> >>> > A job result post-processing function could check that there are
>> >>> > offloading
>> >>> > toolchains to be used and spawn the jobs creation for those
>> >>> > toolchains
>> >>> > as
>> >>> > well as append results from one toolchain to the results of some
>> >>> > other
>> >>> > accordingly to the programming model implementation needs.
>> >>>
>> >>> Again it's not clear to me why we cannot and should not represent this
>> >>> in
>> >>> the
>> >>> Action graph.  It's that graph that's supposed to tell us what we're
>> >>> going
>> >>> to
>> >>> do.
>> >>
>> >>
>> >> I guess  covered this above, if not let me know.
>> >>
>> >>>
>> >>>
>> >>> > g) Reflect the offloading programming model in the naming of the
>> >>> > save-temps files.
>> >>>
>> >>> We already do this somewhat; e.g. for CUDA with save-temps, we'll
>> >>> output
>> >>> foo.s
>> >>> and foo-sm_35.s.  Extending this to be more robust (e.g. including the
>> >>> triple)
>> >>> seems fine.
>> >>
>> >>
>> >> Yes, programming model, host/device (in openmp same triple can be used
>> >> for
>> >> both host and device), and bound arch will make sure we get unique
>> >> names.
>> >>
>> >>>
>> >>>
>> >>> > h) Use special options -target-offload=<triple> to specify
>> >>> > offloading
>> >>> > targets and delimit options meant for a toolchain.
>> >>>
>> >>> I think I agree that we should generalize the flags we're using.
>> >>>
>> >>> I'm not sold on the name or structure (I'm not aware of any other
>> >>> flags
>> >>> that
>> >>> affect *all* flags following them?), but we can bikeshed about that
>> >>> separately.
>> >>
>> >>
>> >> I guess we only have -Xblah and friends to change how the next option
>> >> is
>> >> used. I agree, this is issue is in many ways orthogonal to everything
>> >> else
>> >> in this proposal, we can address it separately.
>> >>
>> >>>
>> >>>
>> >>> > i) Use the offload kinds in the toolchain to drive the commands
>> >>> > generation by Tools.
>> >>>
>> >>> I'm not sure exactly what this means, but it doesn't sound
>> >>> particularly contentious.  :)
>> >>
>> >>
>> >> Sorry about that... My explanations get convoluted sometimes...
>> >>
>> >> What I mean is that, instead of relying on a file input, or attributes
>> >> of an
>> >> action, a command can be generated by looking at the offloading kind of
>> >> the
>> >> toolchain.
>> >>
>> >> E.g.
>> >>
>> >> isCuda = isToolChainKind(Toolchain:OFFLOAD_KINDS_CUDA).
>> >>
>> >> or
>> >>
>> >> if(isHostToolChainKind(Toolchain:OFFLOAD_KINDS_CUDA))
>> >>  AuxTriple = getDeviceToolChain(Toolchain:OFFLOAD_KINDS_CUDA)
>> >>
>> >> This would allow a programming model to tune things here an there.
>> >> Remember,
>> >> that the same toolchain can, in general, be used by different
>> >> programming
>> >> models, and simultaneously by host and devices. So being able to do
>> >> things
>> >> based on a kind simplifies things a lot.
>> >>
>> >>>
>> >>>
>> >>> > 3. We are willing to help with implementation of CUDA-specific parts
>> >>> > when
>> >>> > they overlap with the common infrastructure; though we expect that
>> >>> > effort to
>> >>> > be driven also by other contributors specifically interested in CUDA
>> >>> > support
>> >>> > that have the necessary know-how (both on CUDA itself and how it is
>> >>> > supported
>> >>> > in Clang / LLVM).
>> >>>
>> >>> Given that this is work that doesn't really help CUDA (the driver
>> >>> works
>> >>> fine
>> >>> for us as-is), I am not sure we'll be able to devote significant
>> >>> resources
>> >>> to
>> >>> this project.  Of course we'll be available to assist with code
>> >>> relevant
>> >>> reviews and give advice.
>> >>>
>> >>> I think like any other change to clang, the responsibility will rest
>> >>> on
>> >>> the
>> >>> authors not to break existing functionality, at the very least
>> >>> inasmuch as
>> >>> is
>> >>> checked by existing unit tests.
>> >>>
>> >>
>> >> Sure, having your feedback/suggestions and help with code review is all
>> >> we
>> >> ask for! We will try not to break anything (and if for some reason we
>> >> do
>> >> will fix it right away). Also, if we find opportunities to improve the
>> >> CUDA
>> >> support we will be happy to contribute that as well.
>> >>
>> >> I hope I addressed the concerns you expressed initially. Let me know
>> >> any
>> >> other thoughts you may have.
>> >>
>> >> Thanks again!
>> >> Samuel
>> >>
>> >>>
>> >>> Regards,
>> >>> -Justin
>> >>>
>> >>> On Thu, Mar 3, 2016 at 12:03 PM, Samuel F Antao via cfe-dev
>> >>> <cfe-dev at lists.llvm.org> wrote:
>> >>> > Hi Chris,
>> >>> >
>> >>> > I agree with Andrey when he says this should be a separate
>> >>> > discussion.
>> >>> >
>> >>> > I think that aiming at having a library that would support any
>> >>> > possible
>> >>> > programming model would take a long time, as it requires a lot of
>> >>> > consensus
>> >>> > namely from who is maintaining programming models already in clang
>> >>> > (e.g.
>> >>> > CUDA). We should try to have something incremental.
>> >>> >
>> >>> > I'm happy to discuss and know more about the design and code you
>> >>> > would
>> >>> > like
>> >>> > to contribute to this, but I think you should post it in a different
>> >>> > thread.
>> >>> >
>> >>> > Thanks,
>> >>> > Samuel
>> >>> >
>> >>> > 2016-03-03 11:20 GMT-05:00 C Bergström <cfe-dev at lists.llvm.org>:
>> >>> >>
>> >>> >> On Thu, Mar 3, 2016 at 10:19 PM, Ronan Keryell <ronan at keryell.fr>
>> >>> >> wrote:
>> >>> >> >>>>>> On Thu, 3 Mar 2016 18:19:43 +0700, C Bergström via cfe-dev
>> >>> >> >>>>>> <cfe-dev at lists.llvm.org> said:
>> >>> >> >
>> >>> >> >     C> On Thu, Mar 3, 2016 at 5:50 PM, Ronan KERYELL via cfe-dev
>> >>> >> >     C> <cfe-dev at lists.llvm.org> wrote:
>> >>> >> >
>> >>> >> >     >> Just to be sure to understand: you are thinking about
>> >>> >> > being
>> >>> >> > able
>> >>> >> >     >> to outline several "languages" at once, such as CUDA *and*
>> >>> >> >     >> OpenMP, right ?
>> >>> >> >     >>
>> >>> >> >     >> I think it is required for serious applications. For
>> >>> >> > example,
>> >>> >> > in
>> >>> >> >     >> the HPC world, it is common to have hybrid multi-node
>> >>> >> >     >> heterogeneous applications that use MPI+OpenMP+OpenCL for
>> >>> >> >     >> example. Since MPI and OpenCL are just libraries, there is
>> >>> >> > only
>> >>> >> >     >> OpenMP to off-load here. But if we move to OpenCL SYCL
>> >>> >> > instead
>> >>> >> >     >> with MPI+OpenMP+SYCL then both OpenMP and SYCL have to be
>> >>> >> > managed
>> >>> >> >     >> by the Clang off-loading infrastructure at the same time
>> >>> >> > and
>> >>> >> > be
>> >>> >> >     >> sure they combine gracefully...
>> >>> >> >     >>
>> >>> >> >     >> I think your second proposal about (un)bundling can
>> >>> >> > already
>> >>> >> >     >> manage this.
>> >>> >> >     >>
>> >>> >> >     >> Otherwise, what about the code outlining itself used in
>> >>> >> > the
>> >>> >> >     >> off-loading process? The code generation itself requires
>> >>> >> > to
>> >>> >> >     >> outline the kernel code to some external functions to be
>> >>> >> > compiled
>> >>> >> >     >> by the kernel compiler. Do you think it is up to the
>> >>> >> > programmer
>> >>> >> >     >> to re-use the recipes used by OpenMP and CUDA for example
>> >>> >> > or
>> >>> >> > it
>> >>> >> >     >> would be interesting to have a third proposal to abstract
>> >>> >> > more
>> >>> >> >     >> the outliner to be configurable to handle globally OpenMP,
>> >>> >> > CUDA,
>> >>> >> >     >> SYCL...?
>> >>> >> >
>> >>> >> >     C> Some very good points above and back to my broken record..
>> >>> >> >
>> >>> >> >     C> If all offloading is done in a single unified library -
>> >>> >> >     C> a. Lowering in LLVM is greatly simplified since there's
>> >>> >> > ***1***
>> >>> >> >     C> offload API to be supported A region that's outlined for
>> >>> >> > SYCL,
>> >>> >> >     C> CUDA or something else is essentially the same thing. (I
>> >>> >> > do
>> >>> >> >     C> realize that some transformation may be highly target
>> >>> >> > specific,
>> >>> >> >     C> but to me that's more target hw driven than programming
>> >>> >> > model
>> >>> >> >     C> driven)
>> >>> >> >
>> >>> >> >     C> b. Mixing CUDA/OMP/ACC/Foo in theory may "just work" since
>> >>> >> > the
>> >>> >> >     C> same runtime will handle them all. (With the limitation
>> >>> >> > that
>> >>> >> > if
>> >>> >> >     C> you want CUDA to *talk to* OMP or something else there
>> >>> >> > needs
>> >>> >> > to
>> >>> >> >     C> be some glue.  I'm merely saying that 1 application with
>> >>> >> > multiple
>> >>> >> >     C> models in a way that won't conflict)
>> >>> >> >
>> >>> >> >     C> c. The driver doesn't need to figure out do I link against
>> >>> >> > some
>> >>> >> >     C> or a multitude of combining/conflicting libcuda, libomp,
>> >>> >> >     C> libsomething - it's liboffload - done
>> >>> >> >
>> >>> >> > Yes, a unified target library would help.
>> >>> >> >
>> >>> >> >     C> The driver proposal and the liboffload proposal should
>> >>> >> > imnsho
>> >>> >> > be
>> >>> >> >     C> tightly coupled and work together as *1*. The goals are
>> >>> >> >     C> significantly overlapping and relevant. If you get the
>> >>> >> > liboffload
>> >>> >> >     C> OMP people to make that more agnostic - I think it
>> >>> >> > simplifies
>> >>> >> > the
>> >>> >> >     C> driver work.
>> >>> >> >
>> >>> >> > So basically it is about introducing a fourth unification:
>> >>> >> > liboffload.
>> >>> >> >
>> >>> >> > A great unification sounds great.
>> >>> >> > My only concern is that if we tie everything together, it would
>> >>> >> > increase
>> >>> >> > the entry cost: all the different components should be ready in
>> >>> >> > lock-step.
>> >>> >> > If there is already a runtime available, it would be easier to
>> >>> >> > start
>> >>> >> > with and develop the other part in the meantime.
>> >>> >> > So from a pragmatic agile point-of-view, I would prefer not to
>> >>> >> > impose
>> >>> >> > a
>> >>> >> > strong unification.
>> >>> >>
>> >>> >> I think may not be explaining clearly - let me elaborate by example
>> >>> >> a
>> >>> >> bit
>> >>> >> below
>> >>> >>
>> >>> >> > In the proposal of Samuel, all the parts seem independent.
>> >>> >> >
>> >>> >> >     C>   ------ More specific to this proposal - device
>> >>> >> >     C> linker vs host linker. What do you do for IPA/LTO or whole
>> >>> >> >     C> program optimizations? (Outside the scope of this
>> >>> >> > project.. ?)
>> >>> >> >
>> >>> >> > Ouch. I did not think about it. It sounds like science-fiction
>> >>> >> > for
>> >>> >> > now. :-) Probably outside the scope of this project..
>> >>> >>
>> >>> >> It should certainly not be science fiction or an after-thought. I
>> >>> >> won't go into shameless self promotion, but there are certainly
>> >>> >> useful
>> >>> >> things you can do when you have a "whole device kernel"
>> >>> >> perspective.
>> >>> >>
>> >>> >> To digress into the liboffload component of this (sorry)
>> >>> >> what we have today is basically liboffload/src/all source files
>> >>> >> mucked
>> >>> >> together
>> >>> >>
>> >>> >> What I'm proposing would look more like this
>> >>> >>
>> >>> >> liboffload/src/common_middle_layer_glue # to start this may be
>> >>> >> "best
>> >>> >> effort"
>> >>> >> liboffload/src/omp # This code should exist today, but ideally
>> >>> >> should
>> >>> >> build on top of the middle layer
>> >>> >> liboffload/src/ptx # this may exist today - not sure
>> >>> >> liboffload/src/amd_gpu # probably doesn't exist, but
>> >>> >> wouldn't/shouldn't block anything
>> >>> >> liboffload/src/phi # may exist in some form
>> >>> >> liboffload/src/cuda # may exist in some form outside of the OMP
>> >>> >> work
>> >>> >>
>> >>> >> The end result would be liboffload.
>> >>> >>
>> >>> >> Above and below the common middle layer API are programming model
>> >>> >> or
>> >>> >> hardware specific. To add a new hw backend you just implement the
>> >>> >> things the middle layer needs. To add a new programming model you
>> >>> >> build on top of the common layer. I'm not trying to force
>> >>> >> anyone/everyone to switch to this now - I'm hoping that by being a
>> >>> >> squeaky wheel this isolation of design and layers is there from the
>> >>> >> start - even if not perfect. I think it's sloppy to not consider
>> >>> >> this
>> >>> >> actually. LLVM's code generation is clean and has a nice separation
>> >>> >> per target (for the most part) - why should the offload library
>> >>> >> have
>> >>> >> bad design which just needs to be refactored later. I've seen
>> >>> >> others
>> >>> >> in the community beat up Intel to force them to have higher quality
>> >>> >> code before inclusion... some of this may actually be just minor
>> >>> >> refactoring to come close to the target. (No pun intended)
>> >>> >> -------------
>> >>> >> If others become open to this design - I'm happy to contribute more
>> >>> >> tangible details on the actual middle API.
>> >>> >>
>> >>> >> the objects which the driver has to deal with may and probably do
>> >>> >> overlap to some extent with the objects the liboffload has to load
>> >>> >> or
>> >>> >> deal with. Is there an API the driver can hook into to magically
>> >>> >> handle that or is it all per-device and 1-off..
>> >>> >> _______________________________________________
>> >>> >> cfe-dev mailing list
>> >>> >> cfe-dev at lists.llvm.org
>> >>> >> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>> >>> >
>> >>> >
>> >>> >
>> >>> > _______________________________________________
>> >>> > cfe-dev mailing list
>> >>> > cfe-dev at lists.llvm.org
>> >>> > http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>> >>> >
>> >>> _______________________________________________
>> >>> cfe-dev mailing list
>> >>> cfe-dev at lists.llvm.org
>> >>> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>> >>
>> >>
>> _______________________________________________
>> cfe-dev mailing list
>> cfe-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>
>