<div dir="ltr">Justin,<div><br></div><div>Thank you for your comments. Please take a look at the patch that addresses them:</div><div><br></div><div><a href="https://urldefense.proofpoint.com/v2/url?u=http-3A__reviews.llvm.org_D11273&d=AwMFaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=BSqEv9KvKMW_Ob8SyngJ70KdZISM_ASROnREeq0cCxk&m=BPDbwrOJQa88vx-xog69xE2KPTxie1k24KLH8QxgaIU&s=g215Fok8T-djRr2vXDc_8-dyW6-9UdBJy5esKvswqKI&e=">http://reviews.llvm.org/D11273</a><br><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Jul 13, 2015 at 11:39 PM, Justin Bogner <span dir="ltr"><<a href="mailto:mail@justinbogner.com" target="_blank">mail@justinbogner.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div><div>Artem Belevich <<a href="mailto:tra@google.com" target="_blank">tra@google.com</a>> writes:<br>
> Author: tra<br>
> Date: Mon Jul 13 18:27:56 2015<br>
> New Revision: 242085<br>
><br>
> URL: <a href="https://urldefense.proofpoint.com/v2/url?u=http-3A__llvm.org_viewvc_llvm-2Dproject-3Frev-3D242085-26view-3Drev&d=AwMFaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=BSqEv9KvKMW_Ob8SyngJ70KdZISM_ASROnREeq0cCxk&m=BPDbwrOJQa88vx-xog69xE2KPTxie1k24KLH8QxgaIU&s=FdcEdCsNU6SJsUB6yiecYfmjJOmsUEeGZof0ckaHKRE&e=" rel="noreferrer" target="_blank">http://llvm.org/viewvc/llvm-project?rev=242085&view=rev</a><br>
> Log:<br>
> [cuda] Driver changes to compile and stitch together host and device-side CUDA code.<br>
><br>
>Â Â NOTE: reverts r242077 to reinstate r242058, r242065, 242067<br>
>Â Â Â Â Â and includes fix for OS X test failures.<br>
><br>
>Â Â - Changed driver pipeline to compile host and device side of CUDA<br>
>Â Â Â files and incorporate results of device-side compilation into host<br>
>Â Â Â object file.<br>
><br>
>Â Â - Added a test for cuda pipeline creation in clang driver.<br>
><br>
>Â Â New clang options:<br>
>  --cuda-host-only  - Do host-side compilation only.<br>
>Â Â --cuda-device-only - Do device-side compilation only.<br>
><br>
>Â Â --cuda-gpu-arch=<ARCH> - specify GPU architecture for device-side<br>
>Â Â Â compilation. E.g. sm_35, sm_30. Default is sm_20. May be used more<br>
>Â Â Â than once in which case one device-compilation will be done per<br>
>Â Â Â unique specified GPU architecture.<br>
><br>
>Â Â Differential Revision: <a href="https://urldefense.proofpoint.com/v2/url?u=http-3A__reviews.llvm.org_D9509&d=AwMFaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=BSqEv9KvKMW_Ob8SyngJ70KdZISM_ASROnREeq0cCxk&m=BPDbwrOJQa88vx-xog69xE2KPTxie1k24KLH8QxgaIU&s=NhC9dfG36dHnNivHNsDPlEEoTOae1w8yhJfJ8plsiv8&e=" rel="noreferrer" target="_blank">http://reviews.llvm.org/D9509</a><br>> +<br>
> +CudaHostAction::CudaHostAction(std::unique_ptr<Action> Input,<br>
> +Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â const ActionList &_DeviceActions)<br>
<br>
</div></div>Identifiers starting with an underscore than a capital are reserved.<br>
Please just call this DeviceActions.<br>
<span><br></span></blockquote><div><br></div><div>Done.</div><div><br></div><div>Â </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><span>
> +Â Â : Action(CudaHostClass, std::move(Input)), DeviceActions(_DeviceActions) {}<br>
> +<br>
> +CudaHostAction::~CudaHostAction() {<br>
> +Â for (iterator it = DeviceActions.begin(), ie = DeviceActions.end(); it != ie;<br>
<br>
</span>We usually name variables starting with a capital letter, so these<br>
should probably be `I` and `E` rather than `it` and `ie`. Might as well<br>
use auto here as well, since they're just iterators. <br></blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
This applies in several places later in this patch too. Please clean<br>
them up.<br></blockquote><div><br></div><div><br class="">Fixed here and in other places.</div><div>Â Â </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
<div><div>> +Â // Check whether any of device actions stopped before they could generate PTX.<br>
> +Â bool PartialCompilation = false;<br>
> +Â bool DeviceOnlyCompilation = Args.hasArg(options::OPT_cuda_device_only);<br>
<br>
</div></div>The ordering here is confusing - it looks like DeviceOnlyCompilation is<br>
related to the loop that sets PartialCompilation.<br>
<div><div><br></div></div></blockquote><div><br></div><div>Moved DeviceOnlyCompilation down to where it's used.</div><div>Â </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div><div>> +Â Â for (unsigned i = 0, e = GpuArchList.size(); i != e; ++i)<br>
> +Â Â Â Actions.push_back(<br>
> +Â Â Â Â Â new CudaDeviceAction(std::unique_ptr<Action>(CudaDeviceActions[i]),<br>
> +Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â GpuArchList[i], /* AtTopLevel */ true));<br>
> +Â Â // Kill host action in case of device-only compilation.<br>
> +Â Â if (DeviceOnlyCompilation)<br>
> +Â Â Â Current.reset(nullptr);<br>
> +Â Â return Current;<br>
> +Â } else {<br>
<br>
</div></div>Since the `if` returns unconditionally, an early exit is better than an<br>
else.<br>
<div><div><br></div></div></blockquote><div><br></div><div>Done.</div><div><br></div><div>Â </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div><div>
> +Â Â // Outputs of device actions during complete CUDA compilation get created<br>
> +Â Â // with AtTopLevel=false and become inputs for the host action.<br>
> +Â Â ActionList DeviceActions;<br>
> +Â Â for (unsigned i = 0, e = GpuArchList.size(); i != e; ++i)<br>
> +Â Â Â DeviceActions.push_back(<br>
> +Â Â Â Â Â new CudaDeviceAction(std::unique_ptr<Action>(CudaDeviceActions[i]),<br>
> +Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â GpuArchList[i], /* AtTopLevel */ false));<br>
> +Â Â // Return a new host action that incorporates original host action and all<br>
> +Â Â // device actions.<br>
> +Â Â return std::unique_ptr<Action>(<br>
> +Â Â Â Â new CudaHostAction(std::move(Current), DeviceActions));<br>
> +Â }<br>
> +}<br>
> +<br>
>Â void Driver::BuildActions(const ToolChain &TC, DerivedArgList &Args,<br>
>Â Â Â Â Â Â Â Â Â Â Â Â Â Â const InputList &Inputs, ActionList &Actions) const {<br>
>Â Â llvm::PrettyStackTraceString CrashInfo("Building compilation actions");<br>
> @@ -1312,6 +1412,25 @@ void Driver::BuildActions(const ToolChai<br>
>Â Â Â Â continue;<br>
>Â Â Â }<br>
><br>
> +Â Â phases::ID CudaInjectionPhase;<br>
> +Â Â if (isSaveTempsEnabled()) {<br>
> +Â Â Â // All phases are done independently, inject GPU blobs during compilation<br>
> +Â Â Â // phase as that's where we generate glue code to init them.<br>
> +Â Â Â CudaInjectionPhase = phases::Compile;<br>
> +Â Â } else {<br>
> +Â Â Â // Assumes that clang does everything up until linking phase, so we inject<br>
> +Â Â Â // cuda device actions at the last step before linking. Otherwise CUDA<br>
> +Â Â Â // host action forces preprocessor into a separate invocation.<br>
> +Â Â Â if (FinalPhase == phases::Link) {<br>
<br>
</div></div>This is `else if`. Nesting the if inside the else is confusing.<br>
<span><br>
> +Â Â Â Â for (auto i = PL.begin(), e = PL.end(); i != e; ++i) {<br>
> +Â Â Â Â Â auto next = i + 1;<br>
> +Â Â Â Â Â if (next != e && *next == phases::Link)<br>
<br>
</span>What if i == e?<br>
<span><br></span></blockquote><div>Â </div><div>Then CudaInjectionPhase would be left uninitialized which is a bug. Fixed, which also made moot 'else if' point you made above.</div><div><br></div><div>Â </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><span>
> +Â Â Â Â Â Â CudaInjectionPhase = *i;<br>
> +Â Â Â Â }<br>
> +Â Â Â } else<br>
> +Â Â Â Â CudaInjectionPhase = FinalPhase;<br>
> +Â Â }<br>
> +<br>
>Â Â Â // Build the pipeline for this file.<br>
>Â Â Â std::unique_ptr<Action> Current(new InputAction(*InputArg, InputType));<br>
>Â Â Â for (SmallVectorImpl<phases::ID>::iterator i = PL.begin(), e = PL.end();<br>
> @@ -1337,6 +1456,15 @@ void Driver::BuildActions(const ToolChai<br>
><br>
>Â Â Â Â // Otherwise construct the appropriate action.<br>
>Â Â Â Â Current = ConstructPhaseAction(TC, Args, Phase, std::move(Current));<br>
> +<br>
> +Â Â Â if (InputType == types::TY_CUDA && Phase == CudaInjectionPhase &&<br>
> +Â Â Â Â Â !Args.hasArg(options::OPT_cuda_host_only)) {<br>
> +Â Â Â Â Current = buildCudaActions(*this, TC, Args, InputArg, InputType,<br>
> +Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â std::move(Current), Actions);<br>
> +Â Â Â Â if (!Current)<br>
> +Â Â Â Â Â break;<br>
> +Â Â Â }<br>
<br>
</span>This whole block of code seems out of place. In fact, why doesn't CUDA<br>
just have its own phases::ID? It seems like that would simplify all of<br>
this.<br>
<span><br></span></blockquote><div><br></div><div>It does not quite fit the 'phase' concept. Phases implicitly assume that things are done linearly. CUDA, on the other hand, effectively builds a tree and the weird code above is a somewhat broken way to glue results of device-side builds into just the right action of the host build pipeline, so that results can be used to modify compiler arguments.</div><div><br></div><div>I've got a better way of dealing with that that I'll send as a separate patch for review soon.</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><span>
> +<br>
>Â Â Â Â if (Current->getType() == types::TY_Nothing)<br>
>Â Â Â Â Â break;<br>
>Â Â Â }<br>
> @@ -1576,7 +1704,13 @@ static const Tool *SelectToolForJob(Comp<br>
>Â Â if (isa<BackendJobAction>(JA)) {<br>
>Â Â Â // Check if the compiler supports emitting LLVM IR.<br>
>Â Â Â assert(Inputs->size() == 1);<br>
> -Â Â JobAction *CompileJA = cast<CompileJobAction>(*Inputs->begin());<br>
> +Â Â JobAction *CompileJA;<br>
> +Â Â // Extract real host action, if it's a CudaHostAction.<br>
> +Â Â if (CudaHostAction *CudaHA = dyn_cast<CudaHostAction>(*Inputs->begin()))<br>
> +Â Â Â CompileJA = cast<CompileJobAction>(*CudaHA->begin());<br>
> +Â Â else<br>
> +Â Â Â CompileJA = cast<CompileJobAction>(*Inputs->begin());<br>
<br>
</span>This seems kind of awkward. Not sure I have a better suggestion though.<br>
<div><div><br></div></div></blockquote><div><br></div><div>This will also be changes in the upcoming patch.</div><div><br></div><div>Â </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div><div>>Â Â if (const InputAction *IA = dyn_cast<InputAction>(A)) {<br>
>Â Â Â // FIXME: It would be nice to not claim this here; maybe the old scheme of<br>
>Â Â Â // just using Args was better?<br>
> @@ -1635,11 +1783,24 @@ void Driver::BuildJobsForAction(Compilat<br>
>Â Â Â else<br>
>Â Â Â Â TC = &C.getDefaultToolChain();<br>
><br>
> -Â Â BuildJobsForAction(C, *BAA->begin(), TC, BAA->getArchName(), AtTopLevel,<br>
> +Â Â BuildJobsForAction(C, *BAA->begin(), TC, ArchName, AtTopLevel,<br>
<br>
</div></div>I guess this is just an unrelated cleanup? Better to commit those<br>
separately in the future.<br>
<span><br></span></blockquote><div><br></div><div>Noted.</div><div><br></div><div>Â </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><span>
>Â Â Â Â Â Â Â Â Â Â Â Â Â MultipleArchs, LinkingOutput, Result);<br>
>Â Â Â return;<br>
>Â Â }<br>
><br>
> +Â if (const CudaDeviceAction *CDA = dyn_cast<CudaDeviceAction>(A)) {<br>
> +Â Â // Figure out which NVPTX triple to use for device-side compilation based on<br>
> +Â Â // whether host is 64-bit.<br>
> +Â Â llvm::Triple DeviceTriple(C.getDefaultToolChain().getTriple().isArch64Bit()<br>
> +Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â ? "nvptx64-nvidia-cuda"<br>
> +Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â : "nvptx-nvidia-cuda");<br>
<br>
</span>It seems like a pretty bad layering violation to need to know these<br>
particular triples right here...<br>
<br></blockquote><div><br></div><div>Well, it's up to driver to figure out device-side triple. While I agree that it's way too much target knowledge for the driver, I'm short on practical ideas. Do you have any suggestions?</div><div>Â </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">Is getDefaultToolChain() even the right thing to do here? Isn't `TC`<br>
more appropriate?<br></blockquote><div><br></div><div>Good point. TC indeed should do.</div><div>Â </div></div><div><br></div>-- <br><div><div dir="ltr">--Artem Belevich</div></div>
</div></div></div>