[cfe-dev] [RFC][OpenMP][CUDA] Unified Offloading Support in Clang Driver

Eric Christopher via cfe-dev cfe-dev at lists.llvm.org
Fri Mar 4 14:26:27 PST 2016


On Fri, Mar 4, 2016 at 2:21 PM Justin Lebar via cfe-dev <
cfe-dev at lists.llvm.org> wrote:

> > So, in your opinion, should we create an action for each programing
> model or
> > should we have a generic one?
>
> We currently have generic Actions, like "CompileAction".  I think those
> should
> stay?  BindArch and the like add a lot of complexity, maybe there's a way
> to
> get rid of those, merging their information into the other Actions.
>
> Does that answer your question?  I'm afraid I may be misunderstanding.
>
> > I have some application that I've been compiling with clang, and I
> usually
> > just run "make". Now I read somewhere that a new release of clang has
> > support for CUDA and I happen to have a nice loop that I could implement
> with
> > CUDA. So, I add a new file with the new implementation, then I run
> "make", it
> > compiles but when I run it crashes. The reason it crashes is that I was
> using
> > separate compilation and know I need to change all my makefile rules to
> > forward a new kind of file, that I may not even know what it is.
>
> Again, I do not think that we should make up new file formats and
> incorporate
> them into clang so that people can use new compiler features without
> modifying
> their makefiles.
>
> I think it is far more important that low-level tools such as ld and
> objdump
> continue to work on the files that the compiler outputs.  That likely means
> we'll have to output N separate files, one for the host and one for each
> device
> arch.
>
> But hey, this is just my opinion, and I'm a nobody here.  No offense taken
> if
> the community decides otherwise.
>

I haven't disagreed with anything you've said yet :)

-eric


>
> On Fri, Mar 4, 2016 at 2:14 PM, Samuel F Antao <sfantao at us.ibm.com> wrote:
> >
> >
> > 2016-03-04 14:40 GMT-05:00 Justin Lebar via cfe-dev
> > <cfe-dev at lists.llvm.org>:
> >>
> >> > If, as you say, building the Action graph for CUDA and OpenMP is
> >> > complicated, I think we should fix that.
> >>
> >> It occurs to me that perhaps all you want is to build up the Action
> >> graph in a non-language-specific manner, and then pass that to e.g.
> >> CUDA-specific code that will massage the Action graph into what it
> >> wants.
> >>
> >> I don't know if that would be an improvement over the current
> >> situation -- there are a lot of edge cases -- but it might.
> >
> >
> > That's a possible approach. Could be a good way to organize it. However,
> if
> > you have two different programming models those transformations would
> happen
> > in a given sequence, so the one that comes last will have to be aware of
> the
> > programming model that was used for the first transformation. This
> wouldn't
> > be as clean as having the host actions (which are always the same for a
> > given file and options) and have all the job generation to orbit around
> > that.
> >
> > Let me study the problem of doing this with actions and see all the
> possible
> > implications.
> >
> >>
> >>
> >> On Fri, Mar 4, 2016 at 11:34 AM, Justin Lebar <jlebar at google.com>
> wrote:
> >> >> This has two objectives. One is to avoid the creation of actions that
> >> >> are programming model specific. The other is to remove complexity
> from the
> >> >> action creation that would have to mix phases and different
> programming
> >> >> models DAG requirements
> >> >
> >> > As I understand this, we're saying that we'll build up an action
> >> > graph, but it is sort of a lie, in that it does not encapsulate all of
> >> > the logic we're interested in.  Then, when we convert the actions into
> >> > jobs, we'll postprocess them using language-specific logic to make the
> >> > jobs do what we want.
> >> >
> >> > I am not in favor of this approach, as I understand it.  Although I
> >> > acknowledge that it would simplify building the Action graph itself,
> >> > it does so by moving this complexity into a "shadow Action graph" --
> >> > the DAG that *actually* describes what we're going to do (which may
> >> > never be explicitly constructed, but still exists in our minds).  I
> >> > don't think this is actually a simplification.
> >> >
> >> > If, as you say, building the Action graph for CUDA and OpenMP is
> >> > complicated, I think we should fix that.  Then we'll be able to
> >> > continue using our existing tools to e.g. inspect the Action graph
> >> > generated by the driver.
> >> >
> >> >> I see the driver already as a wrapper, so I don't think it is not
> >> >> appropriate to use it.
> >> >
> >> > You and I, being compiler hackers, understand that the driver is a
> >> > wrapper.  However, to a user, the driver is the compiler.  No build
> >> > system invokes clang -cc1 directly.
> >> >
> >> >> However, I think the creation of the blob should be done by an
> external
> >> >> tool, say, as it was a linker.
> >> >
> >> > Sure, but this isn't the difference I was getting at.  What I was
> >> > trying to say is that the creation of the blob should be done by a
> >> > tool which is external to the compiler *from the perspective of the
> >> > user*.  Meaning that, the driver should not invoke this tool.  If the
> >> > user wants it, they can invoke it explicitly (as they might use tar to
> >> > bundle their object files).
> >> >
> >> >> I'd put it in this way: an bundled file should work as a normal host
> >> >> file, regardless of what device code it embeds.
> >> >
> >> > OK, but this still makes all existing tools useless if I want to
> >> > inspect device code.  If you give me a .o file and tell me that it's
> >> > device code, I can inspect it, disassemble it, or whatever using
> >> > existing tools.  If it's a bundle in a file format we made up here on
> >> > this list, there's very little chance existing tools are going to let
> >> > me get the device code out in a sensible way.
> >> >
> >> > Again, I don't think that inventing file formats -- however simple --
> >> > is a business that we should be getting into.
> >> >
> >> >> Even for ELF, I agree putting the code in some section is more
> elegant.
> >> >> I'll investigate the possibilities to implement that.
> >> >
> >> > Maybe, but unless there's a way to annotate that section and say "this
> >> > section contains code for architecture foo", then objdump isn't going
> >> > to work sensibly on that section, and I think that's basically game
> >> > over.
> >> >
> >> >> In other side, we have text files. My opinion is that we should have
> >> >> something that is easy to read and edit. How would a bundled text
> file look
> >> >> like in your opinion?
> >> >
> >> > Similarly, this will not interoperate with any existing tools, and I
> >> > think that's job zero.
> >> >
> >> > On Fri, Mar 4, 2016 at 11:06 AM, Samuel F Antao <sfantao at us.ibm.com>
> >> > wrote:
> >> >> Hi Justin,
> >> >>
> >> >> It's great to have your feedback!
> >> >>
> >> >> 2016-03-03 17:09 GMT-05:00 Justin Lebar via cfe-dev
> >> >> <cfe-dev at lists.llvm.org>:
> >> >>>
> >> >>> Hi, I'm one of the people working on CUDA in clang.
> >> >>>
> >> >>> In general I agree that the support for CUDA today is rather ad-hoc;
> >> >>> it
> >> >>> can
> >> >>> likely be improved.  However, there are many points in this proposal
> >> >>> that
> >> >>> I do
> >> >>> not understand.  Inasmuch as I think I understand it, I am concerned
> >> >>> that
> >> >>> it's
> >> >>> adding a new abstractions instead of fixing the existing ones, and
> >> >>> that
> >> >>> this
> >> >>> will result in a lot of additional complexity.
> >> >>>
> >> >>> > a) Create toolchains for host and offload devices before creating
> >> >>> > the
> >> >>> > actions.
> >> >>> >
> >> >>> > The driver has to detect the employed programming models through
> the
> >> >>> > provided
> >> >>> > options (e.g. -fcuda or -fopenmp) or file extensions. For each
> host
> >> >>> > and
> >> >>> > offloading device and programming model, it should create a
> >> >>> > toolchain.
> >> >>>
> >> >>> Seems sane to me.
> >> >>>
> >> >>> > b) Keep the generation of Actions independent of the program
> model.
> >> >>> >
> >> >>> > In my view, the Actions should only depend on the compile phases
> >> >>> > requested by
> >> >>> > the user and the file extensions of the input files. Only the way
> >> >>> > those
> >> >>> > actions are interpreted to create jobs should be dependent on the
> >> >>> > programming
> >> >>> > model.  This would avoid complicating the actions creation with
> >> >>> > dependencies
> >> >>> > that only make sense to some programming models, which would make
> >> >>> > the
> >> >>> > implementation hard to scale when new programming models are to be
> >> >>> > adopted.
> >> >>>
> >> >>> I don't quite understand what you're proposing here, or what you're
> >> >>> trying
> >> >>> to
> >> >>> accomplish with this change.
> >> >>>
> >> >>> Perhaps it would help if you could give a concrete example of how
> this
> >> >>> would
> >> >>> change e.g. CUDA or Mac universal binary compilation?
> >> >>>
> >> >>> For example, in CUDA compilation, we have an action which says
> >> >>> "compile
> >> >>> everything below here as cuda arch sm_35".  sm_35 comes from a
> >> >>> command-line
> >> >>> flag, so as I understand your proposal, this could not be in the
> >> >>> action
> >> >>> graph,
> >> >>> because it doesn't come from the filename or the compile phases
> >> >>> requested
> >> >>> by
> >> >>> the user.  So, how will we express this notion that some actions
> >> >>> should be
> >> >>> compiled for a particular arch?
> >> >>
> >> >>
> >> >> This has two objectives. One is to avoid the creation of actions that
> >> >> are
> >> >> programming model specific. The other is to remove complexity from
> the
> >> >> action creation that would have to mix phases and different
> programming
> >> >> models DAG requirements - currently CUDA only requires one single
> >> >> dependency
> >> >> but if you have more programming models with different requirements
> and
> >> >> add
> >> >> separate compilation on top of that, the action generation will
> become
> >> >> complex and hard to scale. Just to clarify, I am not saying that
> >> >> creating
> >> >> actions for each programming model won't work, I just thing that
> doing
> >> >> this
> >> >> differently will ensure that adding new programming models will be
> less
> >> >> disruptive as the programming model specifics will be contained in a
> >> >> single
> >> >> place.
> >> >>
> >> >> The way I see it is that an action just packs some information
> >> >> processed
> >> >> from a bunch of input info. However, creating an action specific for
> a
> >> >> programming model does not prevent you from having to have dedicated
> >> >> logic
> >> >> to deal with  it when the jobs are created. So, given that the input
> >> >> info
> >> >> that results in an action is also available when the jobs are
> created,
> >> >> what
> >> >> I propose it to do all the programming model specifics in a single
> >> >> place. We
> >> >> already have a cache of results in the jobs builder that could help
> >> >> navigate
> >> >> the dependences and, even better, the queries this cache can provide
> >> >> can be
> >> >> completely agnostic of the programming model.
> >> >>
> >> >> Let me try to give you an example on how this proposal would affect
> >> >> CUDA:
> >> >>
> >> >> - Lets assume that the actions are generated the same way they are
> for
> >> >> the
> >> >> host. And that we already have in the driver the host toolchain and
> >> >> also the
> >> >> nvptx toolchain, each marked with a new toolchain kind "CUDA" (these
> >> >> toolchain were inferred from the options used to invoke the driver
> >> >> and/or
> >> >> file extensions).
> >> >>
> >> >> - The jobs start to be created for the host as usual.
> >> >>
> >> >> - Before the any job is constructed there would be a post-processing
> of
> >> >> the
> >> >> results, so that extra results could be appended if required by the
> >> >> programming model.
> >> >>
> >> >> - This is what would happen in the post-processing function:
> >> >> {
> >> >>   if (!isThisCUDAHostToolChain)
> >> >>     return;
> >> >>
> >> >>   if (!ActionIsCompile)
> >> >>     return;
> >> >>
> >> >>   if (InputActionDependence.type != TY_CUDA)
> >> >>     return;
> >> >>
> >> >>   //Make checks currently in buildCudaActions()
> >> >>
> >> >>   DevTC = getDeviceToolChainOfKind(CUDA);
> >> >>   Action *Asm = CachedResults().giveMeDependentAsmAction();
> >> >>
> >> >>   for (c : CUDAComputeCapabilities ) {
> >> >>     NewResult = BuildJobsForAction(DevTC, Asm)
> >> >>     // Or maybe better
> >> >>     NewResult = BuildJobsForAction(DevTC, LinkAction(Asm))
> >> >>
> >> >>     Results.push_back(NewResult);
> >> >>   }
> >> >> }
> >> >>
> >> >> CachedResults would offer some extra functionality that is not
> >> >> programming
> >> >> model specific, and this would provide the same functionality the
> CUDA
> >> >> action is providing. Adding a new programming model would only
> require
> >> >> adding an instance of this post-process ( apart from the creation of
> >> >> the
> >> >> toolchains that would occur before anything starts to be done).
> >> >>
> >> >> I agree these things are complicated to fully understand/explain
> based
> >> >> a
> >> >> summary in a email. I'll try to come up with a proposal-patch early
> >> >> next
> >> >> week so that we have something more concrete to discuss.
> >> >>
> >> >>>
> >> >>>
> >> >>> > c) Use unbundling and bundling tools agnostic of the programming
> >> >>> > model.
> >> >>> >
> >> >>> > I propose a single change in the action creation and that is the
> >> >>> > creation of
> >> >>> > a “unbundling” and "bundling” action whose goal is to prevent the
> >> >>> > user
> >> >>> > to
> >> >>> > have to deal with multiple files generated from multiple
> toolchains
> >> >>> > (host
> >> >>> > toolchain and offloading devices’ toolchains) if he uses separate
> >> >>> > compilation
> >> >>> > in his build system.
> >> >>>
> >> >>> I'm not sure I understand what "separate compilation" is here.  Do
> you
> >> >>> mean, a
> >> >>> compilation strategy which outputs logically separate machine code
> for
> >> >>> each
> >> >>> architecture, only to have this code combined at link time?  (In
> >> >>> contrast
> >> >>> to
> >> >>> how we currently compile CUDA, where the device code for a file is
> >> >>> integrated
> >> >>> into the host code for that file at compile time?)
> >> >>
> >> >>
> >> >> That's correct. With separate compilation I also mean the ability to
> >> >> link
> >> >> device side code, using a device linker (nvlink for CUDA).
> >> >>
> >> >>>
> >> >>> If that's right, then what I understand you're proposing here is
> that,
> >> >>> instead
> >> >>> of outputting N different object files -- one for the host, and N-1
> >> >>> for
> >> >>> all our
> >> >>> device architectures -- we'd just output one blob which clang would
> >> >>> understand
> >> >>> how to handle.
> >> >>
> >> >>
> >> >> Correct.
> >> >>
> >> >>>
> >> >>>
> >> >>> For my part, I am highly wary of introducing a new file format into
> >> >>> clang's
> >> >>> output.  Historically, clang (along with other compilers) does not
> >> >>> output
> >> >>> proprietary blobs.  Instead, we output object files in
> >> >>> well-understood,
> >> >>> interoperable formats, such as ELF.  This is beneficial because
> there
> >> >>> are
> >> >>> lots
> >> >>> of existing tools which can handle these files.  It also allows e.g.
> >> >>> code
> >> >>> compiled with clang to be linked with g++.
> >> >>>
> >> >>> Build tools are universally awful, and I sympathize with the urge
> not
> >> >>> to
> >> >>> change
> >> >>> them.  But I don't think this is a business we want the compiler to
> be
> >> >>> in.
> >> >>> Instead, if a user wants this kind of "fat object file", they could
> >> >>> obtain
> >> >>> one
> >> >>> by using a simple wrapper around clang.  If this wrapper's output
> >> >>> format
> >> >>> became
> >> >>> widely-used, we could then consider supporting it directly within
> >> >>> clang,
> >> >>> but
> >> >>> that's a proposition for many years in the future.
> >> >>
> >> >>
> >> >> I see the driver already as a wrapper, so I don't think it is not
> >> >> appropriate to use it. However, I think the creation of the blob
> should
> >> >> be
> >> >> done by an external tool, say, as it was a linker. I have an initial
> >> >> proposal in
> >> >> http://lists.llvm.org/pipermail/cfe-dev/2016-February/047548.html,
> but
> >> >> based
> >> >> on your input and also Jonas, I have to rethink a few things.
> >> >>
> >> >> I agree when you say that you would like to have the blob working
> well
> >> >> with
> >> >> other tools. Jonas in some previous email also expressed  this
> concern.
> >> >> I'd
> >> >> put it in this way: an bundled file should work as a normal host
> file,
> >> >> regardless of what device code it embeds.
> >> >>
> >> >> For ELF files this works just fine:
> >> >>
> >> >> clang a.c -c -o a.o
> >> >> echo "Some offloading bytes" >> a.o
> >> >> clang a.o -o a.out
> >> >> a.out
> >> >>
> >> >> However for other binary formats, we need to wrap in a different.
> Even
> >> >> for
> >> >> ELF, I agree putting the code in some section is more elegant. I'll
> >> >> investigate the possibilities to implement that.
> >> >>
> >> >> In other side, we have text files. My opinion is that we should have
> >> >> something that is easy to read and edit. How would a bundled text
> file
> >> >> look
> >> >> like in your opinion?
> >> >>
> >> >> Do you think have all the device code guarded as a comment in the
> >> >> bottom is
> >> >> acceptable? That would work well as a host file.
> >> >>
> >> >>>
> >> >>>
> >> >>> > d) Allow the target toolchain to request the host toolchain to be
> >> >>> > used
> >> >>> > for a given action.
> >> >>>
> >> >>> Seems sane to me.
> >> >>>
> >> >>> > e)  Use a job results cache to enable sharing results between
> device
> >> >>> > and
> >> >>> > host toolchains.
> >> >>>
> >> >>> I don't understand why we need a cache for job results.  Why can we
> >> >>> not
> >> >>> set up
> >> >>> the Action graph such that each node has the correct inputs?
> (You've
> >> >>> actually
> >> >>> sketched exactly what I think the Action graph should look like, for
> >> >>> CUDA
> >> >>> and
> >> >>> OpenMP compilations.)
> >> >>
> >> >>
> >> >> I think what I explain above covers this one. If not, please let me
> >> >> know.
> >> >> Just to summarize, I'm not saying expressing things in Actions won't
> >> >> work, I
> >> >> just think that will be more complex if we have multiple programming
> >> >> models
> >> >> (all potentially used in the same compile) and separate compilation
> in
> >> >> place. We already have a cache in the jobs builder, I was just
> planing
> >> >> to
> >> >> leverage that.
> >> >>
> >> >>>
> >> >>>
> >> >>> > f) Intercept the jobs creation before the emission of the command.
> >> >>> >
> >> >>> > In my view this is the only change required in the driver (apart
> >> >>> > from
> >> >>> > the
> >> >>> > obvious toolchain changes) that would be dependent on the
> >> >>> > programming
> >> >>> > model.
> >> >>> > A job result post-processing function could check that there are
> >> >>> > offloading
> >> >>> > toolchains to be used and spawn the jobs creation for those
> >> >>> > toolchains
> >> >>> > as
> >> >>> > well as append results from one toolchain to the results of some
> >> >>> > other
> >> >>> > accordingly to the programming model implementation needs.
> >> >>>
> >> >>> Again it's not clear to me why we cannot and should not represent
> this
> >> >>> in
> >> >>> the
> >> >>> Action graph.  It's that graph that's supposed to tell us what we're
> >> >>> going
> >> >>> to
> >> >>> do.
> >> >>
> >> >>
> >> >> I guess  covered this above, if not let me know.
> >> >>
> >> >>>
> >> >>>
> >> >>> > g) Reflect the offloading programming model in the naming of the
> >> >>> > save-temps files.
> >> >>>
> >> >>> We already do this somewhat; e.g. for CUDA with save-temps, we'll
> >> >>> output
> >> >>> foo.s
> >> >>> and foo-sm_35.s.  Extending this to be more robust (e.g. including
> the
> >> >>> triple)
> >> >>> seems fine.
> >> >>
> >> >>
> >> >> Yes, programming model, host/device (in openmp same triple can be
> used
> >> >> for
> >> >> both host and device), and bound arch will make sure we get unique
> >> >> names.
> >> >>
> >> >>>
> >> >>>
> >> >>> > h) Use special options -target-offload=<triple> to specify
> >> >>> > offloading
> >> >>> > targets and delimit options meant for a toolchain.
> >> >>>
> >> >>> I think I agree that we should generalize the flags we're using.
> >> >>>
> >> >>> I'm not sold on the name or structure (I'm not aware of any other
> >> >>> flags
> >> >>> that
> >> >>> affect *all* flags following them?), but we can bikeshed about that
> >> >>> separately.
> >> >>
> >> >>
> >> >> I guess we only have -Xblah and friends to change how the next option
> >> >> is
> >> >> used. I agree, this is issue is in many ways orthogonal to everything
> >> >> else
> >> >> in this proposal, we can address it separately.
> >> >>
> >> >>>
> >> >>>
> >> >>> > i) Use the offload kinds in the toolchain to drive the commands
> >> >>> > generation by Tools.
> >> >>>
> >> >>> I'm not sure exactly what this means, but it doesn't sound
> >> >>> particularly contentious.  :)
> >> >>
> >> >>
> >> >> Sorry about that... My explanations get convoluted sometimes...
> >> >>
> >> >> What I mean is that, instead of relying on a file input, or
> attributes
> >> >> of an
> >> >> action, a command can be generated by looking at the offloading kind
> of
> >> >> the
> >> >> toolchain.
> >> >>
> >> >> E.g.
> >> >>
> >> >> isCuda = isToolChainKind(Toolchain:OFFLOAD_KINDS_CUDA).
> >> >>
> >> >> or
> >> >>
> >> >> if(isHostToolChainKind(Toolchain:OFFLOAD_KINDS_CUDA))
> >> >>  AuxTriple = getDeviceToolChain(Toolchain:OFFLOAD_KINDS_CUDA)
> >> >>
> >> >> This would allow a programming model to tune things here an there.
> >> >> Remember,
> >> >> that the same toolchain can, in general, be used by different
> >> >> programming
> >> >> models, and simultaneously by host and devices. So being able to do
> >> >> things
> >> >> based on a kind simplifies things a lot.
> >> >>
> >> >>>
> >> >>>
> >> >>> > 3. We are willing to help with implementation of CUDA-specific
> parts
> >> >>> > when
> >> >>> > they overlap with the common infrastructure; though we expect that
> >> >>> > effort to
> >> >>> > be driven also by other contributors specifically interested in
> CUDA
> >> >>> > support
> >> >>> > that have the necessary know-how (both on CUDA itself and how it
> is
> >> >>> > supported
> >> >>> > in Clang / LLVM).
> >> >>>
> >> >>> Given that this is work that doesn't really help CUDA (the driver
> >> >>> works
> >> >>> fine
> >> >>> for us as-is), I am not sure we'll be able to devote significant
> >> >>> resources
> >> >>> to
> >> >>> this project.  Of course we'll be available to assist with code
> >> >>> relevant
> >> >>> reviews and give advice.
> >> >>>
> >> >>> I think like any other change to clang, the responsibility will rest
> >> >>> on
> >> >>> the
> >> >>> authors not to break existing functionality, at the very least
> >> >>> inasmuch as
> >> >>> is
> >> >>> checked by existing unit tests.
> >> >>>
> >> >>
> >> >> Sure, having your feedback/suggestions and help with code review is
> all
> >> >> we
> >> >> ask for! We will try not to break anything (and if for some reason we
> >> >> do
> >> >> will fix it right away). Also, if we find opportunities to improve
> the
> >> >> CUDA
> >> >> support we will be happy to contribute that as well.
> >> >>
> >> >> I hope I addressed the concerns you expressed initially. Let me know
> >> >> any
> >> >> other thoughts you may have.
> >> >>
> >> >> Thanks again!
> >> >> Samuel
> >> >>
> >> >>>
> >> >>> Regards,
> >> >>> -Justin
> >> >>>
> >> >>> On Thu, Mar 3, 2016 at 12:03 PM, Samuel F Antao via cfe-dev
> >> >>> <cfe-dev at lists.llvm.org> wrote:
> >> >>> > Hi Chris,
> >> >>> >
> >> >>> > I agree with Andrey when he says this should be a separate
> >> >>> > discussion.
> >> >>> >
> >> >>> > I think that aiming at having a library that would support any
> >> >>> > possible
> >> >>> > programming model would take a long time, as it requires a lot of
> >> >>> > consensus
> >> >>> > namely from who is maintaining programming models already in clang
> >> >>> > (e.g.
> >> >>> > CUDA). We should try to have something incremental.
> >> >>> >
> >> >>> > I'm happy to discuss and know more about the design and code you
> >> >>> > would
> >> >>> > like
> >> >>> > to contribute to this, but I think you should post it in a
> different
> >> >>> > thread.
> >> >>> >
> >> >>> > Thanks,
> >> >>> > Samuel
> >> >>> >
> >> >>> > 2016-03-03 11:20 GMT-05:00 C Bergström <cfe-dev at lists.llvm.org>:
> >> >>> >>
> >> >>> >> On Thu, Mar 3, 2016 at 10:19 PM, Ronan Keryell <ronan at keryell.fr
> >
> >> >>> >> wrote:
> >> >>> >> >>>>>> On Thu, 3 Mar 2016 18:19:43 +0700, C Bergström via cfe-dev
> >> >>> >> >>>>>> <cfe-dev at lists.llvm.org> said:
> >> >>> >> >
> >> >>> >> >     C> On Thu, Mar 3, 2016 at 5:50 PM, Ronan KERYELL via
> cfe-dev
> >> >>> >> >     C> <cfe-dev at lists.llvm.org> wrote:
> >> >>> >> >
> >> >>> >> >     >> Just to be sure to understand: you are thinking about
> >> >>> >> > being
> >> >>> >> > able
> >> >>> >> >     >> to outline several "languages" at once, such as CUDA
> *and*
> >> >>> >> >     >> OpenMP, right ?
> >> >>> >> >     >>
> >> >>> >> >     >> I think it is required for serious applications. For
> >> >>> >> > example,
> >> >>> >> > in
> >> >>> >> >     >> the HPC world, it is common to have hybrid multi-node
> >> >>> >> >     >> heterogeneous applications that use MPI+OpenMP+OpenCL
> for
> >> >>> >> >     >> example. Since MPI and OpenCL are just libraries, there
> is
> >> >>> >> > only
> >> >>> >> >     >> OpenMP to off-load here. But if we move to OpenCL SYCL
> >> >>> >> > instead
> >> >>> >> >     >> with MPI+OpenMP+SYCL then both OpenMP and SYCL have to
> be
> >> >>> >> > managed
> >> >>> >> >     >> by the Clang off-loading infrastructure at the same time
> >> >>> >> > and
> >> >>> >> > be
> >> >>> >> >     >> sure they combine gracefully...
> >> >>> >> >     >>
> >> >>> >> >     >> I think your second proposal about (un)bundling can
> >> >>> >> > already
> >> >>> >> >     >> manage this.
> >> >>> >> >     >>
> >> >>> >> >     >> Otherwise, what about the code outlining itself used in
> >> >>> >> > the
> >> >>> >> >     >> off-loading process? The code generation itself requires
> >> >>> >> > to
> >> >>> >> >     >> outline the kernel code to some external functions to be
> >> >>> >> > compiled
> >> >>> >> >     >> by the kernel compiler. Do you think it is up to the
> >> >>> >> > programmer
> >> >>> >> >     >> to re-use the recipes used by OpenMP and CUDA for
> example
> >> >>> >> > or
> >> >>> >> > it
> >> >>> >> >     >> would be interesting to have a third proposal to
> abstract
> >> >>> >> > more
> >> >>> >> >     >> the outliner to be configurable to handle globally
> OpenMP,
> >> >>> >> > CUDA,
> >> >>> >> >     >> SYCL...?
> >> >>> >> >
> >> >>> >> >     C> Some very good points above and back to my broken
> record..
> >> >>> >> >
> >> >>> >> >     C> If all offloading is done in a single unified library -
> >> >>> >> >     C> a. Lowering in LLVM is greatly simplified since there's
> >> >>> >> > ***1***
> >> >>> >> >     C> offload API to be supported A region that's outlined for
> >> >>> >> > SYCL,
> >> >>> >> >     C> CUDA or something else is essentially the same thing. (I
> >> >>> >> > do
> >> >>> >> >     C> realize that some transformation may be highly target
> >> >>> >> > specific,
> >> >>> >> >     C> but to me that's more target hw driven than programming
> >> >>> >> > model
> >> >>> >> >     C> driven)
> >> >>> >> >
> >> >>> >> >     C> b. Mixing CUDA/OMP/ACC/Foo in theory may "just work"
> since
> >> >>> >> > the
> >> >>> >> >     C> same runtime will handle them all. (With the limitation
> >> >>> >> > that
> >> >>> >> > if
> >> >>> >> >     C> you want CUDA to *talk to* OMP or something else there
> >> >>> >> > needs
> >> >>> >> > to
> >> >>> >> >     C> be some glue.  I'm merely saying that 1 application with
> >> >>> >> > multiple
> >> >>> >> >     C> models in a way that won't conflict)
> >> >>> >> >
> >> >>> >> >     C> c. The driver doesn't need to figure out do I link
> against
> >> >>> >> > some
> >> >>> >> >     C> or a multitude of combining/conflicting libcuda, libomp,
> >> >>> >> >     C> libsomething - it's liboffload - done
> >> >>> >> >
> >> >>> >> > Yes, a unified target library would help.
> >> >>> >> >
> >> >>> >> >     C> The driver proposal and the liboffload proposal should
> >> >>> >> > imnsho
> >> >>> >> > be
> >> >>> >> >     C> tightly coupled and work together as *1*. The goals are
> >> >>> >> >     C> significantly overlapping and relevant. If you get the
> >> >>> >> > liboffload
> >> >>> >> >     C> OMP people to make that more agnostic - I think it
> >> >>> >> > simplifies
> >> >>> >> > the
> >> >>> >> >     C> driver work.
> >> >>> >> >
> >> >>> >> > So basically it is about introducing a fourth unification:
> >> >>> >> > liboffload.
> >> >>> >> >
> >> >>> >> > A great unification sounds great.
> >> >>> >> > My only concern is that if we tie everything together, it would
> >> >>> >> > increase
> >> >>> >> > the entry cost: all the different components should be ready in
> >> >>> >> > lock-step.
> >> >>> >> > If there is already a runtime available, it would be easier to
> >> >>> >> > start
> >> >>> >> > with and develop the other part in the meantime.
> >> >>> >> > So from a pragmatic agile point-of-view, I would prefer not to
> >> >>> >> > impose
> >> >>> >> > a
> >> >>> >> > strong unification.
> >> >>> >>
> >> >>> >> I think may not be explaining clearly - let me elaborate by
> example
> >> >>> >> a
> >> >>> >> bit
> >> >>> >> below
> >> >>> >>
> >> >>> >> > In the proposal of Samuel, all the parts seem independent.
> >> >>> >> >
> >> >>> >> >     C>   ------ More specific to this proposal - device
> >> >>> >> >     C> linker vs host linker. What do you do for IPA/LTO or
> whole
> >> >>> >> >     C> program optimizations? (Outside the scope of this
> >> >>> >> > project.. ?)
> >> >>> >> >
> >> >>> >> > Ouch. I did not think about it. It sounds like science-fiction
> >> >>> >> > for
> >> >>> >> > now. :-) Probably outside the scope of this project..
> >> >>> >>
> >> >>> >> It should certainly not be science fiction or an after-thought. I
> >> >>> >> won't go into shameless self promotion, but there are certainly
> >> >>> >> useful
> >> >>> >> things you can do when you have a "whole device kernel"
> >> >>> >> perspective.
> >> >>> >>
> >> >>> >> To digress into the liboffload component of this (sorry)
> >> >>> >> what we have today is basically liboffload/src/all source files
> >> >>> >> mucked
> >> >>> >> together
> >> >>> >>
> >> >>> >> What I'm proposing would look more like this
> >> >>> >>
> >> >>> >> liboffload/src/common_middle_layer_glue # to start this may be
> >> >>> >> "best
> >> >>> >> effort"
> >> >>> >> liboffload/src/omp # This code should exist today, but ideally
> >> >>> >> should
> >> >>> >> build on top of the middle layer
> >> >>> >> liboffload/src/ptx # this may exist today - not sure
> >> >>> >> liboffload/src/amd_gpu # probably doesn't exist, but
> >> >>> >> wouldn't/shouldn't block anything
> >> >>> >> liboffload/src/phi # may exist in some form
> >> >>> >> liboffload/src/cuda # may exist in some form outside of the OMP
> >> >>> >> work
> >> >>> >>
> >> >>> >> The end result would be liboffload.
> >> >>> >>
> >> >>> >> Above and below the common middle layer API are programming model
> >> >>> >> or
> >> >>> >> hardware specific. To add a new hw backend you just implement the
> >> >>> >> things the middle layer needs. To add a new programming model you
> >> >>> >> build on top of the common layer. I'm not trying to force
> >> >>> >> anyone/everyone to switch to this now - I'm hoping that by being
> a
> >> >>> >> squeaky wheel this isolation of design and layers is there from
> the
> >> >>> >> start - even if not perfect. I think it's sloppy to not consider
> >> >>> >> this
> >> >>> >> actually. LLVM's code generation is clean and has a nice
> separation
> >> >>> >> per target (for the most part) - why should the offload library
> >> >>> >> have
> >> >>> >> bad design which just needs to be refactored later. I've seen
> >> >>> >> others
> >> >>> >> in the community beat up Intel to force them to have higher
> quality
> >> >>> >> code before inclusion... some of this may actually be just minor
> >> >>> >> refactoring to come close to the target. (No pun intended)
> >> >>> >> -------------
> >> >>> >> If others become open to this design - I'm happy to contribute
> more
> >> >>> >> tangible details on the actual middle API.
> >> >>> >>
> >> >>> >> the objects which the driver has to deal with may and probably do
> >> >>> >> overlap to some extent with the objects the liboffload has to
> load
> >> >>> >> or
> >> >>> >> deal with. Is there an API the driver can hook into to magically
> >> >>> >> handle that or is it all per-device and 1-off..
> >> >>> >> _______________________________________________
> >> >>> >> cfe-dev mailing list
> >> >>> >> cfe-dev at lists.llvm.org
> >> >>> >> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
> >> >>> >
> >> >>> >
> >> >>> >
> >> >>> > _______________________________________________
> >> >>> > cfe-dev mailing list
> >> >>> > cfe-dev at lists.llvm.org
> >> >>> > http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
> >> >>> >
> >> >>> _______________________________________________
> >> >>> cfe-dev mailing list
> >> >>> cfe-dev at lists.llvm.org
> >> >>> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
> >> >>
> >> >>
> >> _______________________________________________
> >> cfe-dev mailing list
> >> cfe-dev at lists.llvm.org
> >> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
> >
> >
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20160304/21fa3f06/attachment.html>


More information about the cfe-dev mailing list