<div dir="ltr">Hi Justin, Eric,<div><br></div><div>Thanks again for you time to discuss this.</div><div><br></div><div>So, if we are to use a wrapper around the driver you would be able to pack the outputs in whatever format. What about the inputs? We would need to add options to enable passing multiple inputs for the same compilation, right? Also, process those inputs would require replicating a lot of what the driver already does in terms of the checks on the inputs, don't you think? </div><div><br></div><div>Thanks again,</div><div>Samuel</div></div><div class="gmail_extra"><br><div class="gmail_quote">2016-03-04 17:26 GMT-05:00 Eric Christopher via cfe-dev <span dir="ltr"><<a href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a>></span>:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><br><br><div class="gmail_quote"><span class=""><div dir="ltr">On Fri, Mar 4, 2016 at 2:21 PM Justin Lebar via cfe-dev <<a href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">> So, in your opinion, should we create an action for each programing model or<br>
> should we have a generic one?<br>
<br>
We currently have generic Actions, like "CompileAction". I think those should<br>
stay? BindArch and the like add a lot of complexity, maybe there's a way to<br>
get rid of those, merging their information into the other Actions.<br>
<br>
Does that answer your question? I'm afraid I may be misunderstanding.<br>
<br>
> I have some application that I've been compiling with clang, and I usually<br>
> just run "make". Now I read somewhere that a new release of clang has<br>
> support for CUDA and I happen to have a nice loop that I could implement with<br>
> CUDA. So, I add a new file with the new implementation, then I run "make", it<br>
> compiles but when I run it crashes. The reason it crashes is that I was using<br>
> separate compilation and know I need to change all my makefile rules to<br>
> forward a new kind of file, that I may not even know what it is.<br>
<br>
Again, I do not think that we should make up new file formats and incorporate<br>
them into clang so that people can use new compiler features without modifying<br>
their makefiles.<br>
<br>
I think it is far more important that low-level tools such as ld and objdump<br>
continue to work on the files that the compiler outputs. That likely means<br>
we'll have to output N separate files, one for the host and one for each device<br>
arch.<br>
<br>
But hey, this is just my opinion, and I'm a nobody here. No offense taken if<br>
the community decides otherwise.<br></blockquote><div><br></div></span><div>I haven't disagreed with anything you've said yet :)</div><div><br></div><div>-eric</div><div><div class="h5"><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
On Fri, Mar 4, 2016 at 2:14 PM, Samuel F Antao <<a href="mailto:sfantao@us.ibm.com" target="_blank">sfantao@us.ibm.com</a>> wrote:<br>
><br>
><br>
> 2016-03-04 14:40 GMT-05:00 Justin Lebar via cfe-dev<br>
> <<a href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a>>:<br>
>><br>
>> > If, as you say, building the Action graph for CUDA and OpenMP is<br>
>> > complicated, I think we should fix that.<br>
>><br>
>> It occurs to me that perhaps all you want is to build up the Action<br>
>> graph in a non-language-specific manner, and then pass that to e.g.<br>
>> CUDA-specific code that will massage the Action graph into what it<br>
>> wants.<br>
>><br>
>> I don't know if that would be an improvement over the current<br>
>> situation -- there are a lot of edge cases -- but it might.<br>
><br>
><br>
> That's a possible approach. Could be a good way to organize it. However, if<br>
> you have two different programming models those transformations would happen<br>
> in a given sequence, so the one that comes last will have to be aware of the<br>
> programming model that was used for the first transformation. This wouldn't<br>
> be as clean as having the host actions (which are always the same for a<br>
> given file and options) and have all the job generation to orbit around<br>
> that.<br>
><br>
> Let me study the problem of doing this with actions and see all the possible<br>
> implications.<br>
><br>
>><br>
>><br>
>> On Fri, Mar 4, 2016 at 11:34 AM, Justin Lebar <<a href="mailto:jlebar@google.com" target="_blank">jlebar@google.com</a>> wrote:<br>
>> >> This has two objectives. One is to avoid the creation of actions that<br>
>> >> are programming model specific. The other is to remove complexity from the<br>
>> >> action creation that would have to mix phases and different programming<br>
>> >> models DAG requirements<br>
>> ><br>
>> > As I understand this, we're saying that we'll build up an action<br>
>> > graph, but it is sort of a lie, in that it does not encapsulate all of<br>
>> > the logic we're interested in. Then, when we convert the actions into<br>
>> > jobs, we'll postprocess them using language-specific logic to make the<br>
>> > jobs do what we want.<br>
>> ><br>
>> > I am not in favor of this approach, as I understand it. Although I<br>
>> > acknowledge that it would simplify building the Action graph itself,<br>
>> > it does so by moving this complexity into a "shadow Action graph" --<br>
>> > the DAG that *actually* describes what we're going to do (which may<br>
>> > never be explicitly constructed, but still exists in our minds). I<br>
>> > don't think this is actually a simplification.<br>
>> ><br>
>> > If, as you say, building the Action graph for CUDA and OpenMP is<br>
>> > complicated, I think we should fix that. Then we'll be able to<br>
>> > continue using our existing tools to e.g. inspect the Action graph<br>
>> > generated by the driver.<br>
>> ><br>
>> >> I see the driver already as a wrapper, so I don't think it is not<br>
>> >> appropriate to use it.<br>
>> ><br>
>> > You and I, being compiler hackers, understand that the driver is a<br>
>> > wrapper. However, to a user, the driver is the compiler. No build<br>
>> > system invokes clang -cc1 directly.<br>
>> ><br>
>> >> However, I think the creation of the blob should be done by an external<br>
>> >> tool, say, as it was a linker.<br>
>> ><br>
>> > Sure, but this isn't the difference I was getting at. What I was<br>
>> > trying to say is that the creation of the blob should be done by a<br>
>> > tool which is external to the compiler *from the perspective of the<br>
>> > user*. Meaning that, the driver should not invoke this tool. If the<br>
>> > user wants it, they can invoke it explicitly (as they might use tar to<br>
>> > bundle their object files).<br>
>> ><br>
>> >> I'd put it in this way: an bundled file should work as a normal host<br>
>> >> file, regardless of what device code it embeds.<br>
>> ><br>
>> > OK, but this still makes all existing tools useless if I want to<br>
>> > inspect device code. If you give me a .o file and tell me that it's<br>
>> > device code, I can inspect it, disassemble it, or whatever using<br>
>> > existing tools. If it's a bundle in a file format we made up here on<br>
>> > this list, there's very little chance existing tools are going to let<br>
>> > me get the device code out in a sensible way.<br>
>> ><br>
>> > Again, I don't think that inventing file formats -- however simple --<br>
>> > is a business that we should be getting into.<br>
>> ><br>
>> >> Even for ELF, I agree putting the code in some section is more elegant.<br>
>> >> I'll investigate the possibilities to implement that.<br>
>> ><br>
>> > Maybe, but unless there's a way to annotate that section and say "this<br>
>> > section contains code for architecture foo", then objdump isn't going<br>
>> > to work sensibly on that section, and I think that's basically game<br>
>> > over.<br>
>> ><br>
>> >> In other side, we have text files. My opinion is that we should have<br>
>> >> something that is easy to read and edit. How would a bundled text file look<br>
>> >> like in your opinion?<br>
>> ><br>
>> > Similarly, this will not interoperate with any existing tools, and I<br>
>> > think that's job zero.<br>
>> ><br>
>> > On Fri, Mar 4, 2016 at 11:06 AM, Samuel F Antao <<a href="mailto:sfantao@us.ibm.com" target="_blank">sfantao@us.ibm.com</a>><br>
>> > wrote:<br>
>> >> Hi Justin,<br>
>> >><br>
>> >> It's great to have your feedback!<br>
>> >><br>
>> >> 2016-03-03 17:09 GMT-05:00 Justin Lebar via cfe-dev<br>
>> >> <<a href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a>>:<br>
>> >>><br>
>> >>> Hi, I'm one of the people working on CUDA in clang.<br>
>> >>><br>
>> >>> In general I agree that the support for CUDA today is rather ad-hoc;<br>
>> >>> it<br>
>> >>> can<br>
>> >>> likely be improved. However, there are many points in this proposal<br>
>> >>> that<br>
>> >>> I do<br>
>> >>> not understand. Inasmuch as I think I understand it, I am concerned<br>
>> >>> that<br>
>> >>> it's<br>
>> >>> adding a new abstractions instead of fixing the existing ones, and<br>
>> >>> that<br>
>> >>> this<br>
>> >>> will result in a lot of additional complexity.<br>
>> >>><br>
>> >>> > a) Create toolchains for host and offload devices before creating<br>
>> >>> > the<br>
>> >>> > actions.<br>
>> >>> ><br>
>> >>> > The driver has to detect the employed programming models through the<br>
>> >>> > provided<br>
>> >>> > options (e.g. -fcuda or -fopenmp) or file extensions. For each host<br>
>> >>> > and<br>
>> >>> > offloading device and programming model, it should create a<br>
>> >>> > toolchain.<br>
>> >>><br>
>> >>> Seems sane to me.<br>
>> >>><br>
>> >>> > b) Keep the generation of Actions independent of the program model.<br>
>> >>> ><br>
>> >>> > In my view, the Actions should only depend on the compile phases<br>
>> >>> > requested by<br>
>> >>> > the user and the file extensions of the input files. Only the way<br>
>> >>> > those<br>
>> >>> > actions are interpreted to create jobs should be dependent on the<br>
>> >>> > programming<br>
>> >>> > model. This would avoid complicating the actions creation with<br>
>> >>> > dependencies<br>
>> >>> > that only make sense to some programming models, which would make<br>
>> >>> > the<br>
>> >>> > implementation hard to scale when new programming models are to be<br>
>> >>> > adopted.<br>
>> >>><br>
>> >>> I don't quite understand what you're proposing here, or what you're<br>
>> >>> trying<br>
>> >>> to<br>
>> >>> accomplish with this change.<br>
>> >>><br>
>> >>> Perhaps it would help if you could give a concrete example of how this<br>
>> >>> would<br>
>> >>> change e.g. CUDA or Mac universal binary compilation?<br>
>> >>><br>
>> >>> For example, in CUDA compilation, we have an action which says<br>
>> >>> "compile<br>
>> >>> everything below here as cuda arch sm_35". sm_35 comes from a<br>
>> >>> command-line<br>
>> >>> flag, so as I understand your proposal, this could not be in the<br>
>> >>> action<br>
>> >>> graph,<br>
>> >>> because it doesn't come from the filename or the compile phases<br>
>> >>> requested<br>
>> >>> by<br>
>> >>> the user. So, how will we express this notion that some actions<br>
>> >>> should be<br>
>> >>> compiled for a particular arch?<br>
>> >><br>
>> >><br>
>> >> This has two objectives. One is to avoid the creation of actions that<br>
>> >> are<br>
>> >> programming model specific. The other is to remove complexity from the<br>
>> >> action creation that would have to mix phases and different programming<br>
>> >> models DAG requirements - currently CUDA only requires one single<br>
>> >> dependency<br>
>> >> but if you have more programming models with different requirements and<br>
>> >> add<br>
>> >> separate compilation on top of that, the action generation will become<br>
>> >> complex and hard to scale. Just to clarify, I am not saying that<br>
>> >> creating<br>
>> >> actions for each programming model won't work, I just thing that doing<br>
>> >> this<br>
>> >> differently will ensure that adding new programming models will be less<br>
>> >> disruptive as the programming model specifics will be contained in a<br>
>> >> single<br>
>> >> place.<br>
>> >><br>
>> >> The way I see it is that an action just packs some information<br>
>> >> processed<br>
>> >> from a bunch of input info. However, creating an action specific for a<br>
>> >> programming model does not prevent you from having to have dedicated<br>
>> >> logic<br>
>> >> to deal with it when the jobs are created. So, given that the input<br>
>> >> info<br>
>> >> that results in an action is also available when the jobs are created,<br>
>> >> what<br>
>> >> I propose it to do all the programming model specifics in a single<br>
>> >> place. We<br>
>> >> already have a cache of results in the jobs builder that could help<br>
>> >> navigate<br>
>> >> the dependences and, even better, the queries this cache can provide<br>
>> >> can be<br>
>> >> completely agnostic of the programming model.<br>
>> >><br>
>> >> Let me try to give you an example on how this proposal would affect<br>
>> >> CUDA:<br>
>> >><br>
>> >> - Lets assume that the actions are generated the same way they are for<br>
>> >> the<br>
>> >> host. And that we already have in the driver the host toolchain and<br>
>> >> also the<br>
>> >> nvptx toolchain, each marked with a new toolchain kind "CUDA" (these<br>
>> >> toolchain were inferred from the options used to invoke the driver<br>
>> >> and/or<br>
>> >> file extensions).<br>
>> >><br>
>> >> - The jobs start to be created for the host as usual.<br>
>> >><br>
>> >> - Before the any job is constructed there would be a post-processing of<br>
>> >> the<br>
>> >> results, so that extra results could be appended if required by the<br>
>> >> programming model.<br>
>> >><br>
>> >> - This is what would happen in the post-processing function:<br>
>> >> {<br>
>> >> if (!isThisCUDAHostToolChain)<br>
>> >> return;<br>
>> >><br>
>> >> if (!ActionIsCompile)<br>
>> >> return;<br>
>> >><br>
>> >> if (InputActionDependence.type != TY_CUDA)<br>
>> >> return;<br>
>> >><br>
>> >> //Make checks currently in buildCudaActions()<br>
>> >><br>
>> >> DevTC = getDeviceToolChainOfKind(CUDA);<br>
>> >> Action *Asm = CachedResults().giveMeDependentAsmAction();<br>
>> >><br>
>> >> for (c : CUDAComputeCapabilities ) {<br>
>> >> NewResult = BuildJobsForAction(DevTC, Asm)<br>
>> >> // Or maybe better<br>
>> >> NewResult = BuildJobsForAction(DevTC, LinkAction(Asm))<br>
>> >><br>
>> >> Results.push_back(NewResult);<br>
>> >> }<br>
>> >> }<br>
>> >><br>
>> >> CachedResults would offer some extra functionality that is not<br>
>> >> programming<br>
>> >> model specific, and this would provide the same functionality the CUDA<br>
>> >> action is providing. Adding a new programming model would only require<br>
>> >> adding an instance of this post-process ( apart from the creation of<br>
>> >> the<br>
>> >> toolchains that would occur before anything starts to be done).<br>
>> >><br>
>> >> I agree these things are complicated to fully understand/explain based<br>
>> >> a<br>
>> >> summary in a email. I'll try to come up with a proposal-patch early<br>
>> >> next<br>
>> >> week so that we have something more concrete to discuss.<br>
>> >><br>
>> >>><br>
>> >>><br>
>> >>> > c) Use unbundling and bundling tools agnostic of the programming<br>
>> >>> > model.<br>
>> >>> ><br>
>> >>> > I propose a single change in the action creation and that is the<br>
>> >>> > creation of<br>
>> >>> > a “unbundling” and "bundling” action whose goal is to prevent the<br>
>> >>> > user<br>
>> >>> > to<br>
>> >>> > have to deal with multiple files generated from multiple toolchains<br>
>> >>> > (host<br>
>> >>> > toolchain and offloading devices’ toolchains) if he uses separate<br>
>> >>> > compilation<br>
>> >>> > in his build system.<br>
>> >>><br>
>> >>> I'm not sure I understand what "separate compilation" is here. Do you<br>
>> >>> mean, a<br>
>> >>> compilation strategy which outputs logically separate machine code for<br>
>> >>> each<br>
>> >>> architecture, only to have this code combined at link time? (In<br>
>> >>> contrast<br>
>> >>> to<br>
>> >>> how we currently compile CUDA, where the device code for a file is<br>
>> >>> integrated<br>
>> >>> into the host code for that file at compile time?)<br>
>> >><br>
>> >><br>
>> >> That's correct. With separate compilation I also mean the ability to<br>
>> >> link<br>
>> >> device side code, using a device linker (nvlink for CUDA).<br>
>> >><br>
>> >>><br>
>> >>> If that's right, then what I understand you're proposing here is that,<br>
>> >>> instead<br>
>> >>> of outputting N different object files -- one for the host, and N-1<br>
>> >>> for<br>
>> >>> all our<br>
>> >>> device architectures -- we'd just output one blob which clang would<br>
>> >>> understand<br>
>> >>> how to handle.<br>
>> >><br>
>> >><br>
>> >> Correct.<br>
>> >><br>
>> >>><br>
>> >>><br>
>> >>> For my part, I am highly wary of introducing a new file format into<br>
>> >>> clang's<br>
>> >>> output. Historically, clang (along with other compilers) does not<br>
>> >>> output<br>
>> >>> proprietary blobs. Instead, we output object files in<br>
>> >>> well-understood,<br>
>> >>> interoperable formats, such as ELF. This is beneficial because there<br>
>> >>> are<br>
>> >>> lots<br>
>> >>> of existing tools which can handle these files. It also allows e.g.<br>
>> >>> code<br>
>> >>> compiled with clang to be linked with g++.<br>
>> >>><br>
>> >>> Build tools are universally awful, and I sympathize with the urge not<br>
>> >>> to<br>
>> >>> change<br>
>> >>> them. But I don't think this is a business we want the compiler to be<br>
>> >>> in.<br>
>> >>> Instead, if a user wants this kind of "fat object file", they could<br>
>> >>> obtain<br>
>> >>> one<br>
>> >>> by using a simple wrapper around clang. If this wrapper's output<br>
>> >>> format<br>
>> >>> became<br>
>> >>> widely-used, we could then consider supporting it directly within<br>
>> >>> clang,<br>
>> >>> but<br>
>> >>> that's a proposition for many years in the future.<br>
>> >><br>
>> >><br>
>> >> I see the driver already as a wrapper, so I don't think it is not<br>
>> >> appropriate to use it. However, I think the creation of the blob should<br>
>> >> be<br>
>> >> done by an external tool, say, as it was a linker. I have an initial<br>
>> >> proposal in<br>
>> >> <a href="http://lists.llvm.org/pipermail/cfe-dev/2016-February/047548.html" rel="noreferrer" target="_blank">http://lists.llvm.org/pipermail/cfe-dev/2016-February/047548.html</a>, but<br>
>> >> based<br>
>> >> on your input and also Jonas, I have to rethink a few things.<br>
>> >><br>
>> >> I agree when you say that you would like to have the blob working well<br>
>> >> with<br>
>> >> other tools. Jonas in some previous email also expressed this concern.<br>
>> >> I'd<br>
>> >> put it in this way: an bundled file should work as a normal host file,<br>
>> >> regardless of what device code it embeds.<br>
>> >><br>
>> >> For ELF files this works just fine:<br>
>> >><br>
>> >> clang a.c -c -o a.o<br>
>> >> echo "Some offloading bytes" >> a.o<br>
>> >> clang a.o -o a.out<br>
>> >> a.out<br>
>> >><br>
>> >> However for other binary formats, we need to wrap in a different. Even<br>
>> >> for<br>
>> >> ELF, I agree putting the code in some section is more elegant. I'll<br>
>> >> investigate the possibilities to implement that.<br>
>> >><br>
>> >> In other side, we have text files. My opinion is that we should have<br>
>> >> something that is easy to read and edit. How would a bundled text file<br>
>> >> look<br>
>> >> like in your opinion?<br>
>> >><br>
>> >> Do you think have all the device code guarded as a comment in the<br>
>> >> bottom is<br>
>> >> acceptable? That would work well as a host file.<br>
>> >><br>
>> >>><br>
>> >>><br>
>> >>> > d) Allow the target toolchain to request the host toolchain to be<br>
>> >>> > used<br>
>> >>> > for a given action.<br>
>> >>><br>
>> >>> Seems sane to me.<br>
>> >>><br>
>> >>> > e) Use a job results cache to enable sharing results between device<br>
>> >>> > and<br>
>> >>> > host toolchains.<br>
>> >>><br>
>> >>> I don't understand why we need a cache for job results. Why can we<br>
>> >>> not<br>
>> >>> set up<br>
>> >>> the Action graph such that each node has the correct inputs? (You've<br>
>> >>> actually<br>
>> >>> sketched exactly what I think the Action graph should look like, for<br>
>> >>> CUDA<br>
>> >>> and<br>
>> >>> OpenMP compilations.)<br>
>> >><br>
>> >><br>
>> >> I think what I explain above covers this one. If not, please let me<br>
>> >> know.<br>
>> >> Just to summarize, I'm not saying expressing things in Actions won't<br>
>> >> work, I<br>
>> >> just think that will be more complex if we have multiple programming<br>
>> >> models<br>
>> >> (all potentially used in the same compile) and separate compilation in<br>
>> >> place. We already have a cache in the jobs builder, I was just planing<br>
>> >> to<br>
>> >> leverage that.<br>
>> >><br>
>> >>><br>
>> >>><br>
>> >>> > f) Intercept the jobs creation before the emission of the command.<br>
>> >>> ><br>
>> >>> > In my view this is the only change required in the driver (apart<br>
>> >>> > from<br>
>> >>> > the<br>
>> >>> > obvious toolchain changes) that would be dependent on the<br>
>> >>> > programming<br>
>> >>> > model.<br>
>> >>> > A job result post-processing function could check that there are<br>
>> >>> > offloading<br>
>> >>> > toolchains to be used and spawn the jobs creation for those<br>
>> >>> > toolchains<br>
>> >>> > as<br>
>> >>> > well as append results from one toolchain to the results of some<br>
>> >>> > other<br>
>> >>> > accordingly to the programming model implementation needs.<br>
>> >>><br>
>> >>> Again it's not clear to me why we cannot and should not represent this<br>
>> >>> in<br>
>> >>> the<br>
>> >>> Action graph. It's that graph that's supposed to tell us what we're<br>
>> >>> going<br>
>> >>> to<br>
>> >>> do.<br>
>> >><br>
>> >><br>
>> >> I guess covered this above, if not let me know.<br>
>> >><br>
>> >>><br>
>> >>><br>
>> >>> > g) Reflect the offloading programming model in the naming of the<br>
>> >>> > save-temps files.<br>
>> >>><br>
>> >>> We already do this somewhat; e.g. for CUDA with save-temps, we'll<br>
>> >>> output<br>
>> >>> foo.s<br>
>> >>> and foo-sm_35.s. Extending this to be more robust (e.g. including the<br>
>> >>> triple)<br>
>> >>> seems fine.<br>
>> >><br>
>> >><br>
>> >> Yes, programming model, host/device (in openmp same triple can be used<br>
>> >> for<br>
>> >> both host and device), and bound arch will make sure we get unique<br>
>> >> names.<br>
>> >><br>
>> >>><br>
>> >>><br>
>> >>> > h) Use special options -target-offload=<triple> to specify<br>
>> >>> > offloading<br>
>> >>> > targets and delimit options meant for a toolchain.<br>
>> >>><br>
>> >>> I think I agree that we should generalize the flags we're using.<br>
>> >>><br>
>> >>> I'm not sold on the name or structure (I'm not aware of any other<br>
>> >>> flags<br>
>> >>> that<br>
>> >>> affect *all* flags following them?), but we can bikeshed about that<br>
>> >>> separately.<br>
>> >><br>
>> >><br>
>> >> I guess we only have -Xblah and friends to change how the next option<br>
>> >> is<br>
>> >> used. I agree, this is issue is in many ways orthogonal to everything<br>
>> >> else<br>
>> >> in this proposal, we can address it separately.<br>
>> >><br>
>> >>><br>
>> >>><br>
>> >>> > i) Use the offload kinds in the toolchain to drive the commands<br>
>> >>> > generation by Tools.<br>
>> >>><br>
>> >>> I'm not sure exactly what this means, but it doesn't sound<br>
>> >>> particularly contentious. :)<br>
>> >><br>
>> >><br>
>> >> Sorry about that... My explanations get convoluted sometimes...<br>
>> >><br>
>> >> What I mean is that, instead of relying on a file input, or attributes<br>
>> >> of an<br>
>> >> action, a command can be generated by looking at the offloading kind of<br>
>> >> the<br>
>> >> toolchain.<br>
>> >><br>
>> >> E.g.<br>
>> >><br>
>> >> isCuda = isToolChainKind(Toolchain:OFFLOAD_KINDS_CUDA).<br>
>> >><br>
>> >> or<br>
>> >><br>
>> >> if(isHostToolChainKind(Toolchain:OFFLOAD_KINDS_CUDA))<br>
>> >> AuxTriple = getDeviceToolChain(Toolchain:OFFLOAD_KINDS_CUDA)<br>
>> >><br>
>> >> This would allow a programming model to tune things here an there.<br>
>> >> Remember,<br>
>> >> that the same toolchain can, in general, be used by different<br>
>> >> programming<br>
>> >> models, and simultaneously by host and devices. So being able to do<br>
>> >> things<br>
>> >> based on a kind simplifies things a lot.<br>
>> >><br>
>> >>><br>
>> >>><br>
>> >>> > 3. We are willing to help with implementation of CUDA-specific parts<br>
>> >>> > when<br>
>> >>> > they overlap with the common infrastructure; though we expect that<br>
>> >>> > effort to<br>
>> >>> > be driven also by other contributors specifically interested in CUDA<br>
>> >>> > support<br>
>> >>> > that have the necessary know-how (both on CUDA itself and how it is<br>
>> >>> > supported<br>
>> >>> > in Clang / LLVM).<br>
>> >>><br>
>> >>> Given that this is work that doesn't really help CUDA (the driver<br>
>> >>> works<br>
>> >>> fine<br>
>> >>> for us as-is), I am not sure we'll be able to devote significant<br>
>> >>> resources<br>
>> >>> to<br>
>> >>> this project. Of course we'll be available to assist with code<br>
>> >>> relevant<br>
>> >>> reviews and give advice.<br>
>> >>><br>
>> >>> I think like any other change to clang, the responsibility will rest<br>
>> >>> on<br>
>> >>> the<br>
>> >>> authors not to break existing functionality, at the very least<br>
>> >>> inasmuch as<br>
>> >>> is<br>
>> >>> checked by existing unit tests.<br>
>> >>><br>
>> >><br>
>> >> Sure, having your feedback/suggestions and help with code review is all<br>
>> >> we<br>
>> >> ask for! We will try not to break anything (and if for some reason we<br>
>> >> do<br>
>> >> will fix it right away). Also, if we find opportunities to improve the<br>
>> >> CUDA<br>
>> >> support we will be happy to contribute that as well.<br>
>> >><br>
>> >> I hope I addressed the concerns you expressed initially. Let me know<br>
>> >> any<br>
>> >> other thoughts you may have.<br>
>> >><br>
>> >> Thanks again!<br>
>> >> Samuel<br>
>> >><br>
>> >>><br>
>> >>> Regards,<br>
>> >>> -Justin<br>
>> >>><br>
>> >>> On Thu, Mar 3, 2016 at 12:03 PM, Samuel F Antao via cfe-dev<br>
>> >>> <<a href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a>> wrote:<br>
>> >>> > Hi Chris,<br>
>> >>> ><br>
>> >>> > I agree with Andrey when he says this should be a separate<br>
>> >>> > discussion.<br>
>> >>> ><br>
>> >>> > I think that aiming at having a library that would support any<br>
>> >>> > possible<br>
>> >>> > programming model would take a long time, as it requires a lot of<br>
>> >>> > consensus<br>
>> >>> > namely from who is maintaining programming models already in clang<br>
>> >>> > (e.g.<br>
>> >>> > CUDA). We should try to have something incremental.<br>
>> >>> ><br>
>> >>> > I'm happy to discuss and know more about the design and code you<br>
>> >>> > would<br>
>> >>> > like<br>
>> >>> > to contribute to this, but I think you should post it in a different<br>
>> >>> > thread.<br>
>> >>> ><br>
>> >>> > Thanks,<br>
>> >>> > Samuel<br>
>> >>> ><br>
>> >>> > 2016-03-03 11:20 GMT-05:00 C Bergström <<a href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a>>:<br>
>> >>> >><br>
>> >>> >> On Thu, Mar 3, 2016 at 10:19 PM, Ronan Keryell <<a href="mailto:ronan@keryell.fr" target="_blank">ronan@keryell.fr</a>><br>
>> >>> >> wrote:<br>
>> >>> >> >>>>>> On Thu, 3 Mar 2016 18:19:43 +0700, C Bergström via cfe-dev<br>
>> >>> >> >>>>>> <<a href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a>> said:<br>
>> >>> >> ><br>
>> >>> >> > C> On Thu, Mar 3, 2016 at 5:50 PM, Ronan KERYELL via cfe-dev<br>
>> >>> >> > C> <<a href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a>> wrote:<br>
>> >>> >> ><br>
>> >>> >> > >> Just to be sure to understand: you are thinking about<br>
>> >>> >> > being<br>
>> >>> >> > able<br>
>> >>> >> > >> to outline several "languages" at once, such as CUDA *and*<br>
>> >>> >> > >> OpenMP, right ?<br>
>> >>> >> > >><br>
>> >>> >> > >> I think it is required for serious applications. For<br>
>> >>> >> > example,<br>
>> >>> >> > in<br>
>> >>> >> > >> the HPC world, it is common to have hybrid multi-node<br>
>> >>> >> > >> heterogeneous applications that use MPI+OpenMP+OpenCL for<br>
>> >>> >> > >> example. Since MPI and OpenCL are just libraries, there is<br>
>> >>> >> > only<br>
>> >>> >> > >> OpenMP to off-load here. But if we move to OpenCL SYCL<br>
>> >>> >> > instead<br>
>> >>> >> > >> with MPI+OpenMP+SYCL then both OpenMP and SYCL have to be<br>
>> >>> >> > managed<br>
>> >>> >> > >> by the Clang off-loading infrastructure at the same time<br>
>> >>> >> > and<br>
>> >>> >> > be<br>
>> >>> >> > >> sure they combine gracefully...<br>
>> >>> >> > >><br>
>> >>> >> > >> I think your second proposal about (un)bundling can<br>
>> >>> >> > already<br>
>> >>> >> > >> manage this.<br>
>> >>> >> > >><br>
>> >>> >> > >> Otherwise, what about the code outlining itself used in<br>
>> >>> >> > the<br>
>> >>> >> > >> off-loading process? The code generation itself requires<br>
>> >>> >> > to<br>
>> >>> >> > >> outline the kernel code to some external functions to be<br>
>> >>> >> > compiled<br>
>> >>> >> > >> by the kernel compiler. Do you think it is up to the<br>
>> >>> >> > programmer<br>
>> >>> >> > >> to re-use the recipes used by OpenMP and CUDA for example<br>
>> >>> >> > or<br>
>> >>> >> > it<br>
>> >>> >> > >> would be interesting to have a third proposal to abstract<br>
>> >>> >> > more<br>
>> >>> >> > >> the outliner to be configurable to handle globally OpenMP,<br>
>> >>> >> > CUDA,<br>
>> >>> >> > >> SYCL...?<br>
>> >>> >> ><br>
>> >>> >> > C> Some very good points above and back to my broken record..<br>
>> >>> >> ><br>
>> >>> >> > C> If all offloading is done in a single unified library -<br>
>> >>> >> > C> a. Lowering in LLVM is greatly simplified since there's<br>
>> >>> >> > ***1***<br>
>> >>> >> > C> offload API to be supported A region that's outlined for<br>
>> >>> >> > SYCL,<br>
>> >>> >> > C> CUDA or something else is essentially the same thing. (I<br>
>> >>> >> > do<br>
>> >>> >> > C> realize that some transformation may be highly target<br>
>> >>> >> > specific,<br>
>> >>> >> > C> but to me that's more target hw driven than programming<br>
>> >>> >> > model<br>
>> >>> >> > C> driven)<br>
>> >>> >> ><br>
>> >>> >> > C> b. Mixing CUDA/OMP/ACC/Foo in theory may "just work" since<br>
>> >>> >> > the<br>
>> >>> >> > C> same runtime will handle them all. (With the limitation<br>
>> >>> >> > that<br>
>> >>> >> > if<br>
>> >>> >> > C> you want CUDA to *talk to* OMP or something else there<br>
>> >>> >> > needs<br>
>> >>> >> > to<br>
>> >>> >> > C> be some glue. I'm merely saying that 1 application with<br>
>> >>> >> > multiple<br>
>> >>> >> > C> models in a way that won't conflict)<br>
>> >>> >> ><br>
>> >>> >> > C> c. The driver doesn't need to figure out do I link against<br>
>> >>> >> > some<br>
>> >>> >> > C> or a multitude of combining/conflicting libcuda, libomp,<br>
>> >>> >> > C> libsomething - it's liboffload - done<br>
>> >>> >> ><br>
>> >>> >> > Yes, a unified target library would help.<br>
>> >>> >> ><br>
>> >>> >> > C> The driver proposal and the liboffload proposal should<br>
>> >>> >> > imnsho<br>
>> >>> >> > be<br>
>> >>> >> > C> tightly coupled and work together as *1*. The goals are<br>
>> >>> >> > C> significantly overlapping and relevant. If you get the<br>
>> >>> >> > liboffload<br>
>> >>> >> > C> OMP people to make that more agnostic - I think it<br>
>> >>> >> > simplifies<br>
>> >>> >> > the<br>
>> >>> >> > C> driver work.<br>
>> >>> >> ><br>
>> >>> >> > So basically it is about introducing a fourth unification:<br>
>> >>> >> > liboffload.<br>
>> >>> >> ><br>
>> >>> >> > A great unification sounds great.<br>
>> >>> >> > My only concern is that if we tie everything together, it would<br>
>> >>> >> > increase<br>
>> >>> >> > the entry cost: all the different components should be ready in<br>
>> >>> >> > lock-step.<br>
>> >>> >> > If there is already a runtime available, it would be easier to<br>
>> >>> >> > start<br>
>> >>> >> > with and develop the other part in the meantime.<br>
>> >>> >> > So from a pragmatic agile point-of-view, I would prefer not to<br>
>> >>> >> > impose<br>
>> >>> >> > a<br>
>> >>> >> > strong unification.<br>
>> >>> >><br>
>> >>> >> I think may not be explaining clearly - let me elaborate by example<br>
>> >>> >> a<br>
>> >>> >> bit<br>
>> >>> >> below<br>
>> >>> >><br>
>> >>> >> > In the proposal of Samuel, all the parts seem independent.<br>
>> >>> >> ><br>
>> >>> >> > C> ------ More specific to this proposal - device<br>
>> >>> >> > C> linker vs host linker. What do you do for IPA/LTO or whole<br>
>> >>> >> > C> program optimizations? (Outside the scope of this<br>
>> >>> >> > project.. ?)<br>
>> >>> >> ><br>
>> >>> >> > Ouch. I did not think about it. It sounds like science-fiction<br>
>> >>> >> > for<br>
>> >>> >> > now. :-) Probably outside the scope of this project..<br>
>> >>> >><br>
>> >>> >> It should certainly not be science fiction or an after-thought. I<br>
>> >>> >> won't go into shameless self promotion, but there are certainly<br>
>> >>> >> useful<br>
>> >>> >> things you can do when you have a "whole device kernel"<br>
>> >>> >> perspective.<br>
>> >>> >><br>
>> >>> >> To digress into the liboffload component of this (sorry)<br>
>> >>> >> what we have today is basically liboffload/src/all source files<br>
>> >>> >> mucked<br>
>> >>> >> together<br>
>> >>> >><br>
>> >>> >> What I'm proposing would look more like this<br>
>> >>> >><br>
>> >>> >> liboffload/src/common_middle_layer_glue # to start this may be<br>
>> >>> >> "best<br>
>> >>> >> effort"<br>
>> >>> >> liboffload/src/omp # This code should exist today, but ideally<br>
>> >>> >> should<br>
>> >>> >> build on top of the middle layer<br>
>> >>> >> liboffload/src/ptx # this may exist today - not sure<br>
>> >>> >> liboffload/src/amd_gpu # probably doesn't exist, but<br>
>> >>> >> wouldn't/shouldn't block anything<br>
>> >>> >> liboffload/src/phi # may exist in some form<br>
>> >>> >> liboffload/src/cuda # may exist in some form outside of the OMP<br>
>> >>> >> work<br>
>> >>> >><br>
>> >>> >> The end result would be liboffload.<br>
>> >>> >><br>
>> >>> >> Above and below the common middle layer API are programming model<br>
>> >>> >> or<br>
>> >>> >> hardware specific. To add a new hw backend you just implement the<br>
>> >>> >> things the middle layer needs. To add a new programming model you<br>
>> >>> >> build on top of the common layer. I'm not trying to force<br>
>> >>> >> anyone/everyone to switch to this now - I'm hoping that by being a<br>
>> >>> >> squeaky wheel this isolation of design and layers is there from the<br>
>> >>> >> start - even if not perfect. I think it's sloppy to not consider<br>
>> >>> >> this<br>
>> >>> >> actually. LLVM's code generation is clean and has a nice separation<br>
>> >>> >> per target (for the most part) - why should the offload library<br>
>> >>> >> have<br>
>> >>> >> bad design which just needs to be refactored later. I've seen<br>
>> >>> >> others<br>
>> >>> >> in the community beat up Intel to force them to have higher quality<br>
>> >>> >> code before inclusion... some of this may actually be just minor<br>
>> >>> >> refactoring to come close to the target. (No pun intended)<br>
>> >>> >> -------------<br>
>> >>> >> If others become open to this design - I'm happy to contribute more<br>
>> >>> >> tangible details on the actual middle API.<br>
>> >>> >><br>
>> >>> >> the objects which the driver has to deal with may and probably do<br>
>> >>> >> overlap to some extent with the objects the liboffload has to load<br>
>> >>> >> or<br>
>> >>> >> deal with. Is there an API the driver can hook into to magically<br>
>> >>> >> handle that or is it all per-device and 1-off..<br>
>> >>> >> _______________________________________________<br>
>> >>> >> cfe-dev mailing list<br>
>> >>> >> <a href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a><br>
>> >>> >> <a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev</a><br>
>> >>> ><br>
>> >>> ><br>
>> >>> ><br>
>> >>> > _______________________________________________<br>
>> >>> > cfe-dev mailing list<br>
>> >>> > <a href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a><br>
>> >>> > <a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev</a><br>
>> >>> ><br>
>> >>> _______________________________________________<br>
>> >>> cfe-dev mailing list<br>
>> >>> <a href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a><br>
>> >>> <a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev</a><br>
>> >><br>
>> >><br>
>> _______________________________________________<br>
>> cfe-dev mailing list<br>
>> <a href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a><br>
>> <a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev</a><br>
><br>
><br>
_______________________________________________<br>
cfe-dev mailing list<br>
<a href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a><br>
<a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev</a><br>
</blockquote></div></div></div></div>
<br>_______________________________________________<br>
cfe-dev mailing list<br>
<a href="mailto:cfe-dev@lists.llvm.org">cfe-dev@lists.llvm.org</a><br>
<a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev</a><br>
<br></blockquote></div><br></div>