<div dir="ltr"><br><br><div class="gmail_quote"><div dir="ltr">On Fri, Mar 4, 2016 at 2:21 PM Justin Lebar via cfe-dev <<a href="mailto:cfe-dev@lists.llvm.org">cfe-dev@lists.llvm.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">> So, in your opinion, should we create an action for each programing model or<br>

> should we have a generic one?<br>

<br>

We currently have generic Actions, like "CompileAction".  I think those should<br>

stay?  BindArch and the like add a lot of complexity, maybe there's a way to<br>

get rid of those, merging their information into the other Actions.<br>

<br>

Does that answer your question?  I'm afraid I may be misunderstanding.<br>

<br>

> I have some application that I've been compiling with clang, and I usually<br>

> just run "make". Now I read somewhere that a new release of clang has<br>

> support for CUDA and I happen to have a nice loop that I could implement with<br>

> CUDA. So, I add a new file with the new implementation, then I run "make", it<br>

> compiles but when I run it crashes. The reason it crashes is that I was using<br>

> separate compilation and know I need to change all my makefile rules to<br>

> forward a new kind of file, that I may not even know what it is.<br>

<br>

Again, I do not think that we should make up new file formats and incorporate<br>

them into clang so that people can use new compiler features without modifying<br>

their makefiles.<br>

<br>

I think it is far more important that low-level tools such as ld and objdump<br>

continue to work on the files that the compiler outputs.  That likely means<br>

we'll have to output N separate files, one for the host and one for each device<br>

arch.<br>

<br>

But hey, this is just my opinion, and I'm a nobody here.  No offense taken if<br>

the community decides otherwise.<br></blockquote><div><br></div><div>I haven't disagreed with anything you've said yet :)</div><div><br></div><div>-eric</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

On Fri, Mar 4, 2016 at 2:14 PM, Samuel F Antao <<a href="mailto:sfantao@us.ibm.com" target="_blank">sfantao@us.ibm.com</a>> wrote:<br>

><br>

><br>

> 2016-03-04 14:40 GMT-05:00 Justin Lebar via cfe-dev<br>

> <<a href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a>>:<br>

>><br>

>> > If, as you say, building the Action graph for CUDA and OpenMP is<br>

>> > complicated, I think we should fix that.<br>

>><br>

>> It occurs to me that perhaps all you want is to build up the Action<br>

>> graph in a non-language-specific manner, and then pass that to e.g.<br>

>> CUDA-specific code that will massage the Action graph into what it<br>

>> wants.<br>

>><br>

>> I don't know if that would be an improvement over the current<br>

>> situation -- there are a lot of edge cases -- but it might.<br>

><br>

><br>

> That's a possible approach. Could be a good way to organize it. However, if<br>

> you have two different programming models those transformations would happen<br>

> in a given sequence, so the one that comes last will have to be aware of the<br>

> programming model that was used for the first transformation. This wouldn't<br>

> be as clean as having the host actions (which are always the same for a<br>

> given file and options) and have all the job generation to orbit around<br>

> that.<br>

><br>

> Let me study the problem of doing this with actions and see all the possible<br>

> implications.<br>

><br>

>><br>

>><br>

>> On Fri, Mar 4, 2016 at 11:34 AM, Justin Lebar <<a href="mailto:jlebar@google.com" target="_blank">jlebar@google.com</a>> wrote:<br>

>> >> This has two objectives. One is to avoid the creation of actions that<br>

>> >> are programming model specific. The other is to remove complexity from the<br>

>> >> action creation that would have to mix phases and different programming<br>

>> >> models DAG requirements<br>

>> ><br>

>> > As I understand this, we're saying that we'll build up an action<br>

>> > graph, but it is sort of a lie, in that it does not encapsulate all of<br>

>> > the logic we're interested in.  Then, when we convert the actions into<br>

>> > jobs, we'll postprocess them using language-specific logic to make the<br>

>> > jobs do what we want.<br>

>> ><br>

>> > I am not in favor of this approach, as I understand it.  Although I<br>

>> > acknowledge that it would simplify building the Action graph itself,<br>

>> > it does so by moving this complexity into a "shadow Action graph" --<br>

>> > the DAG that *actually* describes what we're going to do (which may<br>

>> > never be explicitly constructed, but still exists in our minds).  I<br>

>> > don't think this is actually a simplification.<br>

>> ><br>

>> > If, as you say, building the Action graph for CUDA and OpenMP is<br>

>> > complicated, I think we should fix that.  Then we'll be able to<br>

>> > continue using our existing tools to e.g. inspect the Action graph<br>

>> > generated by the driver.<br>

>> ><br>

>> >> I see the driver already as a wrapper, so I don't think it is not<br>

>> >> appropriate to use it.<br>

>> ><br>

>> > You and I, being compiler hackers, understand that the driver is a<br>

>> > wrapper.  However, to a user, the driver is the compiler.  No build<br>

>> > system invokes clang -cc1 directly.<br>

>> ><br>

>> >> However, I think the creation of the blob should be done by an external<br>

>> >> tool, say, as it was a linker.<br>

>> ><br>

>> > Sure, but this isn't the difference I was getting at.  What I was<br>

>> > trying to say is that the creation of the blob should be done by a<br>

>> > tool which is external to the compiler *from the perspective of the<br>

>> > user*.  Meaning that, the driver should not invoke this tool.  If the<br>

>> > user wants it, they can invoke it explicitly (as they might use tar to<br>

>> > bundle their object files).<br>

>> ><br>

>> >> I'd put it in this way: an bundled file should work as a normal host<br>

>> >> file, regardless of what device code it embeds.<br>

>> ><br>

>> > OK, but this still makes all existing tools useless if I want to<br>

>> > inspect device code.  If you give me a .o file and tell me that it's<br>

>> > device code, I can inspect it, disassemble it, or whatever using<br>

>> > existing tools.  If it's a bundle in a file format we made up here on<br>

>> > this list, there's very little chance existing tools are going to let<br>

>> > me get the device code out in a sensible way.<br>

>> ><br>

>> > Again, I don't think that inventing file formats -- however simple --<br>

>> > is a business that we should be getting into.<br>

>> ><br>

>> >> Even for ELF, I agree putting the code in some section is more elegant.<br>

>> >> I'll investigate the possibilities to implement that.<br>

>> ><br>

>> > Maybe, but unless there's a way to annotate that section and say "this<br>

>> > section contains code for architecture foo", then objdump isn't going<br>

>> > to work sensibly on that section, and I think that's basically game<br>

>> > over.<br>

>> ><br>

>> >> In other side, we have text files. My opinion is that we should have<br>

>> >> something that is easy to read and edit. How would a bundled text file look<br>

>> >> like in your opinion?<br>

>> ><br>

>> > Similarly, this will not interoperate with any existing tools, and I<br>

>> > think that's job zero.<br>

>> ><br>

>> > On Fri, Mar 4, 2016 at 11:06 AM, Samuel F Antao <<a href="mailto:sfantao@us.ibm.com" target="_blank">sfantao@us.ibm.com</a>><br>

>> > wrote:<br>

>> >> Hi Justin,<br>

>> >><br>

>> >> It's great to have your feedback!<br>

>> >><br>

>> >> 2016-03-03 17:09 GMT-05:00 Justin Lebar via cfe-dev<br>

>> >> <<a href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a>>:<br>

>> >>><br>

>> >>> Hi, I'm one of the people working on CUDA in clang.<br>

>> >>><br>

>> >>> In general I agree that the support for CUDA today is rather ad-hoc;<br>

>> >>> it<br>

>> >>> can<br>

>> >>> likely be improved.  However, there are many points in this proposal<br>

>> >>> that<br>

>> >>> I do<br>

>> >>> not understand.  Inasmuch as I think I understand it, I am concerned<br>

>> >>> that<br>

>> >>> it's<br>

>> >>> adding a new abstractions instead of fixing the existing ones, and<br>

>> >>> that<br>

>> >>> this<br>

>> >>> will result in a lot of additional complexity.<br>

>> >>><br>

>> >>> > a) Create toolchains for host and offload devices before creating<br>

>> >>> > the<br>

>> >>> > actions.<br>

>> >>> ><br>

>> >>> > The driver has to detect the employed programming models through the<br>

>> >>> > provided<br>

>> >>> > options (e.g. -fcuda or -fopenmp) or file extensions. For each host<br>

>> >>> > and<br>

>> >>> > offloading device and programming model, it should create a<br>

>> >>> > toolchain.<br>

>> >>><br>

>> >>> Seems sane to me.<br>

>> >>><br>

>> >>> > b) Keep the generation of Actions independent of the program model.<br>

>> >>> ><br>

>> >>> > In my view, the Actions should only depend on the compile phases<br>

>> >>> > requested by<br>

>> >>> > the user and the file extensions of the input files. Only the way<br>

>> >>> > those<br>

>> >>> > actions are interpreted to create jobs should be dependent on the<br>

>> >>> > programming<br>

>> >>> > model.  This would avoid complicating the actions creation with<br>

>> >>> > dependencies<br>

>> >>> > that only make sense to some programming models, which would make<br>

>> >>> > the<br>

>> >>> > implementation hard to scale when new programming models are to be<br>

>> >>> > adopted.<br>

>> >>><br>

>> >>> I don't quite understand what you're proposing here, or what you're<br>

>> >>> trying<br>

>> >>> to<br>

>> >>> accomplish with this change.<br>

>> >>><br>

>> >>> Perhaps it would help if you could give a concrete example of how this<br>

>> >>> would<br>

>> >>> change e.g. CUDA or Mac universal binary compilation?<br>

>> >>><br>

>> >>> For example, in CUDA compilation, we have an action which says<br>

>> >>> "compile<br>

>> >>> everything below here as cuda arch sm_35".  sm_35 comes from a<br>

>> >>> command-line<br>

>> >>> flag, so as I understand your proposal, this could not be in the<br>

>> >>> action<br>

>> >>> graph,<br>

>> >>> because it doesn't come from the filename or the compile phases<br>

>> >>> requested<br>

>> >>> by<br>

>> >>> the user.  So, how will we express this notion that some actions<br>

>> >>> should be<br>

>> >>> compiled for a particular arch?<br>

>> >><br>

>> >><br>

>> >> This has two objectives. One is to avoid the creation of actions that<br>

>> >> are<br>

>> >> programming model specific. The other is to remove complexity from the<br>

>> >> action creation that would have to mix phases and different programming<br>

>> >> models DAG requirements - currently CUDA only requires one single<br>

>> >> dependency<br>

>> >> but if you have more programming models with different requirements and<br>

>> >> add<br>

>> >> separate compilation on top of that, the action generation will become<br>

>> >> complex and hard to scale. Just to clarify, I am not saying that<br>

>> >> creating<br>

>> >> actions for each programming model won't work, I just thing that doing<br>

>> >> this<br>

>> >> differently will ensure that adding new programming models will be less<br>

>> >> disruptive as the programming model specifics will be contained in a<br>

>> >> single<br>

>> >> place.<br>

>> >><br>

>> >> The way I see it is that an action just packs some information<br>

>> >> processed<br>

>> >> from a bunch of input info. However, creating an action specific for a<br>

>> >> programming model does not prevent you from having to have dedicated<br>

>> >> logic<br>

>> >> to deal with  it when the jobs are created. So, given that the input<br>

>> >> info<br>

>> >> that results in an action is also available when the jobs are created,<br>

>> >> what<br>

>> >> I propose it to do all the programming model specifics in a single<br>

>> >> place. We<br>

>> >> already have a cache of results in the jobs builder that could help<br>

>> >> navigate<br>

>> >> the dependences and, even better, the queries this cache can provide<br>

>> >> can be<br>

>> >> completely agnostic of the programming model.<br>

>> >><br>

>> >> Let me try to give you an example on how this proposal would affect<br>

>> >> CUDA:<br>

>> >><br>

>> >> - Lets assume that the actions are generated the same way they are for<br>

>> >> the<br>

>> >> host. And that we already have in the driver the host toolchain and<br>

>> >> also the<br>

>> >> nvptx toolchain, each marked with a new toolchain kind "CUDA" (these<br>

>> >> toolchain were inferred from the options used to invoke the driver<br>

>> >> and/or<br>

>> >> file extensions).<br>

>> >><br>

>> >> - The jobs start to be created for the host as usual.<br>

>> >><br>

>> >> - Before the any job is constructed there would be a post-processing of<br>

>> >> the<br>

>> >> results, so that extra results could be appended if required by the<br>

>> >> programming model.<br>

>> >><br>

>> >> - This is what would happen in the post-processing function:<br>

>> >> {<br>

>> >>   if (!isThisCUDAHostToolChain)<br>

>> >>     return;<br>

>> >><br>

>> >>   if (!ActionIsCompile)<br>

>> >>     return;<br>

>> >><br>

>> >>   if (InputActionDependence.type != TY_CUDA)<br>

>> >>     return;<br>

>> >><br>

>> >>   //Make checks currently in buildCudaActions()<br>

>> >><br>

>> >>   DevTC = getDeviceToolChainOfKind(CUDA);<br>

>> >>   Action *Asm = CachedResults().giveMeDependentAsmAction();<br>

>> >><br>

>> >>   for (c : CUDAComputeCapabilities ) {<br>

>> >>     NewResult = BuildJobsForAction(DevTC, Asm)<br>

>> >>     // Or maybe better<br>

>> >>     NewResult = BuildJobsForAction(DevTC, LinkAction(Asm))<br>

>> >><br>

>> >>     Results.push_back(NewResult);<br>

>> >>   }<br>

>> >> }<br>

>> >><br>

>> >> CachedResults would offer some extra functionality that is not<br>

>> >> programming<br>

>> >> model specific, and this would provide the same functionality the CUDA<br>

>> >> action is providing. Adding a new programming model would only require<br>

>> >> adding an instance of this post-process ( apart from the creation of<br>

>> >> the<br>

>> >> toolchains that would occur before anything starts to be done).<br>

>> >><br>

>> >> I agree these things are complicated to fully understand/explain based<br>

>> >> a<br>

>> >> summary in a email. I'll try to come up with a proposal-patch early<br>

>> >> next<br>

>> >> week so that we have something more concrete to discuss.<br>

>> >><br>

>> >>><br>

>> >>><br>

>> >>> > c) Use unbundling and bundling tools agnostic of the programming<br>

>> >>> > model.<br>

>> >>> ><br>

>> >>> > I propose a single change in the action creation and that is the<br>

>> >>> > creation of<br>

>> >>> > a “unbundling” and "bundling” action whose goal is to prevent the<br>

>> >>> > user<br>

>> >>> > to<br>

>> >>> > have to deal with multiple files generated from multiple toolchains<br>

>> >>> > (host<br>

>> >>> > toolchain and offloading devices’ toolchains) if he uses separate<br>

>> >>> > compilation<br>

>> >>> > in his build system.<br>

>> >>><br>

>> >>> I'm not sure I understand what "separate compilation" is here.  Do you<br>

>> >>> mean, a<br>

>> >>> compilation strategy which outputs logically separate machine code for<br>

>> >>> each<br>

>> >>> architecture, only to have this code combined at link time?  (In<br>

>> >>> contrast<br>

>> >>> to<br>

>> >>> how we currently compile CUDA, where the device code for a file is<br>

>> >>> integrated<br>

>> >>> into the host code for that file at compile time?)<br>

>> >><br>

>> >><br>

>> >> That's correct. With separate compilation I also mean the ability to<br>

>> >> link<br>

>> >> device side code, using a device linker (nvlink for CUDA).<br>

>> >><br>

>> >>><br>

>> >>> If that's right, then what I understand you're proposing here is that,<br>

>> >>> instead<br>

>> >>> of outputting N different object files -- one for the host, and N-1<br>

>> >>> for<br>

>> >>> all our<br>

>> >>> device architectures -- we'd just output one blob which clang would<br>

>> >>> understand<br>

>> >>> how to handle.<br>

>> >><br>

>> >><br>

>> >> Correct.<br>

>> >><br>

>> >>><br>

>> >>><br>

>> >>> For my part, I am highly wary of introducing a new file format into<br>

>> >>> clang's<br>

>> >>> output.  Historically, clang (along with other compilers) does not<br>

>> >>> output<br>

>> >>> proprietary blobs.  Instead, we output object files in<br>

>> >>> well-understood,<br>

>> >>> interoperable formats, such as ELF.  This is beneficial because there<br>

>> >>> are<br>

>> >>> lots<br>

>> >>> of existing tools which can handle these files.  It also allows e.g.<br>

>> >>> code<br>

>> >>> compiled with clang to be linked with g++.<br>

>> >>><br>

>> >>> Build tools are universally awful, and I sympathize with the urge not<br>

>> >>> to<br>

>> >>> change<br>

>> >>> them.  But I don't think this is a business we want the compiler to be<br>

>> >>> in.<br>

>> >>> Instead, if a user wants this kind of "fat object file", they could<br>

>> >>> obtain<br>

>> >>> one<br>

>> >>> by using a simple wrapper around clang.  If this wrapper's output<br>

>> >>> format<br>

>> >>> became<br>

>> >>> widely-used, we could then consider supporting it directly within<br>

>> >>> clang,<br>

>> >>> but<br>

>> >>> that's a proposition for many years in the future.<br>

>> >><br>

>> >><br>

>> >> I see the driver already as a wrapper, so I don't think it is not<br>

>> >> appropriate to use it. However, I think the creation of the blob should<br>

>> >> be<br>

>> >> done by an external tool, say, as it was a linker. I have an initial<br>

>> >> proposal in<br>

>> >> <a href="http://lists.llvm.org/pipermail/cfe-dev/2016-February/047548.html" rel="noreferrer" target="_blank">http://lists.llvm.org/pipermail/cfe-dev/2016-February/047548.html</a>, but<br>

>> >> based<br>

>> >> on your input and also Jonas, I have to rethink a few things.<br>

>> >><br>

>> >> I agree when you say that you would like to have the blob working well<br>

>> >> with<br>

>> >> other tools. Jonas in some previous email also expressed  this concern.<br>

>> >> I'd<br>

>> >> put it in this way: an bundled file should work as a normal host  file,<br>

>> >> regardless of what device code it embeds.<br>

>> >><br>

>> >> For ELF files this works just fine:<br>

>> >><br>

>> >> clang a.c -c -o a.o<br>

>> >> echo "Some offloading bytes" >> a.o<br>

>> >> clang a.o -o a.out<br>

>> >> a.out<br>

>> >><br>

>> >> However for other binary formats, we need to wrap in a different. Even<br>

>> >> for<br>

>> >> ELF, I agree putting the code in some section is more elegant. I'll<br>

>> >> investigate the possibilities to implement that.<br>

>> >><br>

>> >> In other side, we have text files. My opinion is that we should have<br>

>> >> something that is easy to read and edit. How would a bundled text file<br>

>> >> look<br>

>> >> like in your opinion?<br>

>> >><br>

>> >> Do you think have all the device code guarded as a comment in the<br>

>> >> bottom is<br>

>> >> acceptable? That would work well as a host file.<br>

>> >><br>

>> >>><br>

>> >>><br>

>> >>> > d) Allow the target toolchain to request the host toolchain to be<br>

>> >>> > used<br>

>> >>> > for a given action.<br>

>> >>><br>

>> >>> Seems sane to me.<br>

>> >>><br>

>> >>> > e)  Use a job results cache to enable sharing results between device<br>

>> >>> > and<br>

>> >>> > host toolchains.<br>

>> >>><br>

>> >>> I don't understand why we need a cache for job results.  Why can we<br>

>> >>> not<br>

>> >>> set up<br>

>> >>> the Action graph such that each node has the correct inputs?  (You've<br>

>> >>> actually<br>

>> >>> sketched exactly what I think the Action graph should look like, for<br>

>> >>> CUDA<br>

>> >>> and<br>

>> >>> OpenMP compilations.)<br>

>> >><br>

>> >><br>

>> >> I think what I explain above covers this one. If not, please let me<br>

>> >> know.<br>

>> >> Just to summarize, I'm not saying expressing things in Actions won't<br>

>> >> work, I<br>

>> >> just think that will be more complex if we have multiple programming<br>

>> >> models<br>

>> >> (all potentially used in the same compile) and separate compilation in<br>

>> >> place. We already have a cache in the jobs builder, I was just planing<br>

>> >> to<br>

>> >> leverage that.<br>

>> >><br>

>> >>><br>

>> >>><br>

>> >>> > f) Intercept the jobs creation before the emission of the command.<br>

>> >>> ><br>

>> >>> > In my view this is the only change required in the driver (apart<br>

>> >>> > from<br>

>> >>> > the<br>

>> >>> > obvious toolchain changes) that would be dependent on the<br>

>> >>> > programming<br>

>> >>> > model.<br>

>> >>> > A job result post-processing function could check that there are<br>

>> >>> > offloading<br>

>> >>> > toolchains to be used and spawn the jobs creation for those<br>

>> >>> > toolchains<br>

>> >>> > as<br>

>> >>> > well as append results from one toolchain to the results of some<br>

>> >>> > other<br>

>> >>> > accordingly to the programming model implementation needs.<br>

>> >>><br>

>> >>> Again it's not clear to me why we cannot and should not represent this<br>

>> >>> in<br>

>> >>> the<br>

>> >>> Action graph.  It's that graph that's supposed to tell us what we're<br>

>> >>> going<br>

>> >>> to<br>

>> >>> do.<br>

>> >><br>

>> >><br>

>> >> I guess  covered this above, if not let me know.<br>

>> >><br>

>> >>><br>

>> >>><br>

>> >>> > g) Reflect the offloading programming model in the naming of the<br>

>> >>> > save-temps files.<br>

>> >>><br>

>> >>> We already do this somewhat; e.g. for CUDA with save-temps, we'll<br>

>> >>> output<br>

>> >>> foo.s<br>

>> >>> and foo-sm_35.s.  Extending this to be more robust (e.g. including the<br>

>> >>> triple)<br>

>> >>> seems fine.<br>

>> >><br>

>> >><br>

>> >> Yes, programming model, host/device (in openmp same triple can be used<br>

>> >> for<br>

>> >> both host and device), and bound arch will make sure we get unique<br>

>> >> names.<br>

>> >><br>

>> >>><br>

>> >>><br>

>> >>> > h) Use special options -target-offload=<triple> to specify<br>

>> >>> > offloading<br>

>> >>> > targets and delimit options meant for a toolchain.<br>

>> >>><br>

>> >>> I think I agree that we should generalize the flags we're using.<br>

>> >>><br>

>> >>> I'm not sold on the name or structure (I'm not aware of any other<br>

>> >>> flags<br>

>> >>> that<br>

>> >>> affect *all* flags following them?), but we can bikeshed about that<br>

>> >>> separately.<br>

>> >><br>

>> >><br>

>> >> I guess we only have -Xblah and friends to change how the next option<br>

>> >> is<br>

>> >> used. I agree, this is issue is in many ways orthogonal to everything<br>

>> >> else<br>

>> >> in this proposal, we can address it separately.<br>

>> >><br>

>> >>><br>

>> >>><br>

>> >>> > i) Use the offload kinds in the toolchain to drive the commands<br>

>> >>> > generation by Tools.<br>

>> >>><br>

>> >>> I'm not sure exactly what this means, but it doesn't sound<br>

>> >>> particularly contentious.  :)<br>

>> >><br>

>> >><br>

>> >> Sorry about that... My explanations get convoluted sometimes...<br>

>> >><br>

>> >> What I mean is that, instead of relying on a file input, or attributes<br>

>> >> of an<br>

>> >> action, a command can be generated by looking at the offloading kind of<br>

>> >> the<br>

>> >> toolchain.<br>

>> >><br>

>> >> E.g.<br>

>> >><br>

>> >> isCuda = isToolChainKind(Toolchain:OFFLOAD_KINDS_CUDA).<br>

>> >><br>

>> >> or<br>

>> >><br>

>> >> if(isHostToolChainKind(Toolchain:OFFLOAD_KINDS_CUDA))<br>

>> >>  AuxTriple = getDeviceToolChain(Toolchain:OFFLOAD_KINDS_CUDA)<br>

>> >><br>

>> >> This would allow a programming model to tune things here an there.<br>

>> >> Remember,<br>

>> >> that the same toolchain can, in general, be used by different<br>

>> >> programming<br>

>> >> models, and simultaneously by host and devices. So being able to do<br>

>> >> things<br>

>> >> based on a kind simplifies things a lot.<br>

>> >><br>

>> >>><br>

>> >>><br>

>> >>> > 3. We are willing to help with implementation of CUDA-specific parts<br>

>> >>> > when<br>

>> >>> > they overlap with the common infrastructure; though we expect that<br>

>> >>> > effort to<br>

>> >>> > be driven also by other contributors specifically interested in CUDA<br>

>> >>> > support<br>

>> >>> > that have the necessary know-how (both on CUDA itself and how it is<br>

>> >>> > supported<br>

>> >>> > in Clang / LLVM).<br>

>> >>><br>

>> >>> Given that this is work that doesn't really help CUDA (the driver<br>

>> >>> works<br>

>> >>> fine<br>

>> >>> for us as-is), I am not sure we'll be able to devote significant<br>

>> >>> resources<br>

>> >>> to<br>

>> >>> this project.  Of course we'll be available to assist with code<br>

>> >>> relevant<br>

>> >>> reviews and give advice.<br>

>> >>><br>

>> >>> I think like any other change to clang, the responsibility will rest<br>

>> >>> on<br>

>> >>> the<br>

>> >>> authors not to break existing functionality, at the very least<br>

>> >>> inasmuch as<br>

>> >>> is<br>

>> >>> checked by existing unit tests.<br>

>> >>><br>

>> >><br>

>> >> Sure, having your feedback/suggestions and help with code review is all<br>

>> >> we<br>

>> >> ask for! We will try not to break anything (and if for some reason we<br>

>> >> do<br>

>> >> will fix it right away). Also, if we find opportunities to improve the<br>

>> >> CUDA<br>

>> >> support we will be happy to contribute that as well.<br>

>> >><br>

>> >> I hope I addressed the concerns you expressed initially. Let me know<br>

>> >> any<br>

>> >> other thoughts you may have.<br>

>> >><br>

>> >> Thanks again!<br>

>> >> Samuel<br>

>> >><br>

>> >>><br>

>> >>> Regards,<br>

>> >>> -Justin<br>

>> >>><br>

>> >>> On Thu, Mar 3, 2016 at 12:03 PM, Samuel F Antao via cfe-dev<br>

>> >>> <<a href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a>> wrote:<br>

>> >>> > Hi Chris,<br>

>> >>> ><br>

>> >>> > I agree with Andrey when he says this should be a separate<br>

>> >>> > discussion.<br>

>> >>> ><br>

>> >>> > I think that aiming at having a library that would support any<br>

>> >>> > possible<br>

>> >>> > programming model would take a long time, as it requires a lot of<br>

>> >>> > consensus<br>

>> >>> > namely from who is maintaining programming models already in clang<br>

>> >>> > (e.g.<br>

>> >>> > CUDA). We should try to have something incremental.<br>

>> >>> ><br>

>> >>> > I'm happy to discuss and know more about the design and code you<br>

>> >>> > would<br>

>> >>> > like<br>

>> >>> > to contribute to this, but I think you should post it in a different<br>

>> >>> > thread.<br>

>> >>> ><br>

>> >>> > Thanks,<br>

>> >>> > Samuel<br>

>> >>> ><br>

>> >>> > 2016-03-03 11:20 GMT-05:00 C Bergström <<a href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a>>:<br>

>> >>> >><br>

>> >>> >> On Thu, Mar 3, 2016 at 10:19 PM, Ronan Keryell <<a href="mailto:ronan@keryell.fr" target="_blank">ronan@keryell.fr</a>><br>

>> >>> >> wrote:<br>

>> >>> >> >>>>>> On Thu, 3 Mar 2016 18:19:43 +0700, C Bergström via cfe-dev<br>

>> >>> >> >>>>>> <<a href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a>> said:<br>

>> >>> >> ><br>

>> >>> >> >     C> On Thu, Mar 3, 2016 at 5:50 PM, Ronan KERYELL via cfe-dev<br>

>> >>> >> >     C> <<a href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a>> wrote:<br>

>> >>> >> ><br>

>> >>> >> >     >> Just to be sure to understand: you are thinking about<br>

>> >>> >> > being<br>

>> >>> >> > able<br>

>> >>> >> >     >> to outline several "languages" at once, such as CUDA *and*<br>

>> >>> >> >     >> OpenMP, right ?<br>

>> >>> >> >     >><br>

>> >>> >> >     >> I think it is required for serious applications. For<br>

>> >>> >> > example,<br>

>> >>> >> > in<br>

>> >>> >> >     >> the HPC world, it is common to have hybrid multi-node<br>

>> >>> >> >     >> heterogeneous applications that use MPI+OpenMP+OpenCL for<br>

>> >>> >> >     >> example. Since MPI and OpenCL are just libraries, there is<br>

>> >>> >> > only<br>

>> >>> >> >     >> OpenMP to off-load here. But if we move to OpenCL SYCL<br>

>> >>> >> > instead<br>

>> >>> >> >     >> with MPI+OpenMP+SYCL then both OpenMP and SYCL have to be<br>

>> >>> >> > managed<br>

>> >>> >> >     >> by the Clang off-loading infrastructure at the same time<br>

>> >>> >> > and<br>

>> >>> >> > be<br>

>> >>> >> >     >> sure they combine gracefully...<br>

>> >>> >> >     >><br>

>> >>> >> >     >> I think your second proposal about (un)bundling can<br>

>> >>> >> > already<br>

>> >>> >> >     >> manage this.<br>

>> >>> >> >     >><br>

>> >>> >> >     >> Otherwise, what about the code outlining itself used in<br>

>> >>> >> > the<br>

>> >>> >> >     >> off-loading process? The code generation itself requires<br>

>> >>> >> > to<br>

>> >>> >> >     >> outline the kernel code to some external functions to be<br>

>> >>> >> > compiled<br>

>> >>> >> >     >> by the kernel compiler. Do you think it is up to the<br>

>> >>> >> > programmer<br>

>> >>> >> >     >> to re-use the recipes used by OpenMP and CUDA for example<br>

>> >>> >> > or<br>

>> >>> >> > it<br>

>> >>> >> >     >> would be interesting to have a third proposal to abstract<br>

>> >>> >> > more<br>

>> >>> >> >     >> the outliner to be configurable to handle globally OpenMP,<br>

>> >>> >> > CUDA,<br>

>> >>> >> >     >> SYCL...?<br>

>> >>> >> ><br>

>> >>> >> >     C> Some very good points above and back to my broken record..<br>

>> >>> >> ><br>

>> >>> >> >     C> If all offloading is done in a single unified library -<br>

>> >>> >> >     C> a. Lowering in LLVM is greatly simplified since there's<br>

>> >>> >> > ***1***<br>

>> >>> >> >     C> offload API to be supported A region that's outlined for<br>

>> >>> >> > SYCL,<br>

>> >>> >> >     C> CUDA or something else is essentially the same thing. (I<br>

>> >>> >> > do<br>

>> >>> >> >     C> realize that some transformation may be highly target<br>

>> >>> >> > specific,<br>

>> >>> >> >     C> but to me that's more target hw driven than programming<br>

>> >>> >> > model<br>

>> >>> >> >     C> driven)<br>

>> >>> >> ><br>

>> >>> >> >     C> b. Mixing CUDA/OMP/ACC/Foo in theory may "just work" since<br>

>> >>> >> > the<br>

>> >>> >> >     C> same runtime will handle them all. (With the limitation<br>

>> >>> >> > that<br>

>> >>> >> > if<br>

>> >>> >> >     C> you want CUDA to *talk to* OMP or something else there<br>

>> >>> >> > needs<br>

>> >>> >> > to<br>

>> >>> >> >     C> be some glue.  I'm merely saying that 1 application with<br>

>> >>> >> > multiple<br>

>> >>> >> >     C> models in a way that won't conflict)<br>

>> >>> >> ><br>

>> >>> >> >     C> c. The driver doesn't need to figure out do I link against<br>

>> >>> >> > some<br>

>> >>> >> >     C> or a multitude of combining/conflicting libcuda, libomp,<br>

>> >>> >> >     C> libsomething - it's liboffload - done<br>

>> >>> >> ><br>

>> >>> >> > Yes, a unified target library would help.<br>

>> >>> >> ><br>

>> >>> >> >     C> The driver proposal and the liboffload proposal should<br>

>> >>> >> > imnsho<br>

>> >>> >> > be<br>

>> >>> >> >     C> tightly coupled and work together as *1*. The goals are<br>

>> >>> >> >     C> significantly overlapping and relevant. If you get the<br>

>> >>> >> > liboffload<br>

>> >>> >> >     C> OMP people to make that more agnostic - I think it<br>

>> >>> >> > simplifies<br>

>> >>> >> > the<br>

>> >>> >> >     C> driver work.<br>

>> >>> >> ><br>

>> >>> >> > So basically it is about introducing a fourth unification:<br>

>> >>> >> > liboffload.<br>

>> >>> >> ><br>

>> >>> >> > A great unification sounds great.<br>

>> >>> >> > My only concern is that if we tie everything together, it would<br>

>> >>> >> > increase<br>

>> >>> >> > the entry cost: all the different components should be ready in<br>

>> >>> >> > lock-step.<br>

>> >>> >> > If there is already a runtime available, it would be easier to<br>

>> >>> >> > start<br>

>> >>> >> > with and develop the other part in the meantime.<br>

>> >>> >> > So from a pragmatic agile point-of-view, I would prefer not to<br>

>> >>> >> > impose<br>

>> >>> >> > a<br>

>> >>> >> > strong unification.<br>

>> >>> >><br>

>> >>> >> I think may not be explaining clearly - let me elaborate by example<br>

>> >>> >> a<br>

>> >>> >> bit<br>

>> >>> >> below<br>

>> >>> >><br>

>> >>> >> > In the proposal of Samuel, all the parts seem independent.<br>

>> >>> >> ><br>

>> >>> >> >     C>   ------ More specific to this proposal - device<br>

>> >>> >> >     C> linker vs host linker. What do you do for IPA/LTO or whole<br>

>> >>> >> >     C> program optimizations? (Outside the scope of this<br>

>> >>> >> > project.. ?)<br>

>> >>> >> ><br>

>> >>> >> > Ouch. I did not think about it. It sounds like science-fiction<br>

>> >>> >> > for<br>

>> >>> >> > now. :-) Probably outside the scope of this project..<br>

>> >>> >><br>

>> >>> >> It should certainly not be science fiction or an after-thought. I<br>

>> >>> >> won't go into shameless self promotion, but there are certainly<br>

>> >>> >> useful<br>

>> >>> >> things you can do when you have a "whole device kernel"<br>

>> >>> >> perspective.<br>

>> >>> >><br>

>> >>> >> To digress into the liboffload component of this (sorry)<br>

>> >>> >> what we have today is basically liboffload/src/all source files<br>

>> >>> >> mucked<br>

>> >>> >> together<br>

>> >>> >><br>

>> >>> >> What I'm proposing would look more like this<br>

>> >>> >><br>

>> >>> >> liboffload/src/common_middle_layer_glue # to start this may be<br>

>> >>> >> "best<br>

>> >>> >> effort"<br>

>> >>> >> liboffload/src/omp # This code should exist today, but ideally<br>

>> >>> >> should<br>

>> >>> >> build on top of the middle layer<br>

>> >>> >> liboffload/src/ptx # this may exist today - not sure<br>

>> >>> >> liboffload/src/amd_gpu # probably doesn't exist, but<br>

>> >>> >> wouldn't/shouldn't block anything<br>

>> >>> >> liboffload/src/phi # may exist in some form<br>

>> >>> >> liboffload/src/cuda # may exist in some form outside of the OMP<br>

>> >>> >> work<br>

>> >>> >><br>

>> >>> >> The end result would be liboffload.<br>

>> >>> >><br>

>> >>> >> Above and below the common middle layer API are programming model<br>

>> >>> >> or<br>

>> >>> >> hardware specific. To add a new hw backend you just implement the<br>

>> >>> >> things the middle layer needs. To add a new programming model you<br>

>> >>> >> build on top of the common layer. I'm not trying to force<br>

>> >>> >> anyone/everyone to switch to this now - I'm hoping that by being a<br>

>> >>> >> squeaky wheel this isolation of design and layers is there from the<br>

>> >>> >> start - even if not perfect. I think it's sloppy to not consider<br>

>> >>> >> this<br>

>> >>> >> actually. LLVM's code generation is clean and has a nice separation<br>

>> >>> >> per target (for the most part) - why should the offload library<br>

>> >>> >> have<br>

>> >>> >> bad design which just needs to be refactored later. I've seen<br>

>> >>> >> others<br>

>> >>> >> in the community beat up Intel to force them to have higher quality<br>

>> >>> >> code before inclusion... some of this may actually be just minor<br>

>> >>> >> refactoring to come close to the target. (No pun intended)<br>

>> >>> >> -------------<br>

>> >>> >> If others become open to this design - I'm happy to contribute more<br>

>> >>> >> tangible details on the actual middle API.<br>

>> >>> >><br>

>> >>> >> the objects which the driver has to deal with may and probably do<br>

>> >>> >> overlap to some extent with the objects the liboffload has to load<br>

>> >>> >> or<br>

>> >>> >> deal with. Is there an API the driver can hook into to magically<br>

>> >>> >> handle that or is it all per-device and 1-off..<br>

>> >>> >> _______________________________________________<br>

>> >>> >> cfe-dev mailing list<br>

>> >>> >> <a href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a><br>

>> >>> >> <a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev</a><br>

>> >>> ><br>

>> >>> ><br>

>> >>> ><br>

>> >>> > _______________________________________________<br>

>> >>> > cfe-dev mailing list<br>

>> >>> > <a href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a><br>

>> >>> > <a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev</a><br>

>> >>> ><br>

>> >>> _______________________________________________<br>

>> >>> cfe-dev mailing list<br>

>> >>> <a href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a><br>

>> >>> <a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev</a><br>

>> >><br>

>> >><br>

>> _______________________________________________<br>

>> cfe-dev mailing list<br>

>> <a href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a><br>

>> <a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev</a><br>

><br>

><br>

_______________________________________________<br>

cfe-dev mailing list<br>

<a href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a><br>

<a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev</a><br>

</blockquote></div></div>