<div dir="ltr">Justin,<div><br></div><div>Thank you for your comments. Please take a look at the patch that addresses them:</div><div><br></div><div><a href="https://urldefense.proofpoint.com/v2/url?u=http-3A__reviews.llvm.org_D11273&d=AwMFaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=BSqEv9KvKMW_Ob8SyngJ70KdZISM_ASROnREeq0cCxk&m=BPDbwrOJQa88vx-xog69xE2KPTxie1k24KLH8QxgaIU&s=g215Fok8T-djRr2vXDc_8-dyW6-9UdBJy5esKvswqKI&e=">http://reviews.llvm.org/D11273</a><br><div class="gmail_extra"><br><div class="gmail_quote">On Mon, Jul 13, 2015 at 11:39 PM, Justin Bogner <span dir="ltr"><<a href="mailto:mail@justinbogner.com" target="_blank">mail@justinbogner.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div><div>Artem Belevich <<a href="mailto:tra@google.com" target="_blank">tra@google.com</a>> writes:<br>

> Author: tra<br>

> Date: Mon Jul 13 18:27:56 2015<br>

> New Revision: 242085<br>

><br>

> URL: <a href="https://urldefense.proofpoint.com/v2/url?u=http-3A__llvm.org_viewvc_llvm-2Dproject-3Frev-3D242085-26view-3Drev&d=AwMFaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=BSqEv9KvKMW_Ob8SyngJ70KdZISM_ASROnREeq0cCxk&m=BPDbwrOJQa88vx-xog69xE2KPTxie1k24KLH8QxgaIU&s=FdcEdCsNU6SJsUB6yiecYfmjJOmsUEeGZof0ckaHKRE&e=" rel="noreferrer" target="_blank">http://llvm.org/viewvc/llvm-project?rev=242085&view=rev</a><br>

> Log:<br>

> [cuda] Driver changes to compile and stitch together host and device-side CUDA code.<br>

><br>

>   NOTE: reverts r242077 to reinstate r242058, r242065, 242067<br>

>         and includes fix for OS X test failures.<br>

><br>

>   - Changed driver pipeline to compile host and device side of CUDA<br>

>     files and incorporate results of device-side compilation into host<br>

>     object file.<br>

><br>

>   - Added a test for cuda pipeline creation in clang driver.<br>

><br>

>   New clang options:<br>

>   --cuda-host-only   - Do host-side compilation only.<br>

>   --cuda-device-only - Do device-side compilation only.<br>

><br>

>   --cuda-gpu-arch=<ARCH> - specify GPU architecture for device-side<br>

>     compilation. E.g. sm_35, sm_30. Default is sm_20. May be used more<br>

>     than once in which case one device-compilation will be done per<br>

>     unique specified GPU architecture.<br>

><br>

>   Differential Revision: <a href="https://urldefense.proofpoint.com/v2/url?u=http-3A__reviews.llvm.org_D9509&d=AwMFaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=BSqEv9KvKMW_Ob8SyngJ70KdZISM_ASROnREeq0cCxk&m=BPDbwrOJQa88vx-xog69xE2KPTxie1k24KLH8QxgaIU&s=NhC9dfG36dHnNivHNsDPlEEoTOae1w8yhJfJ8plsiv8&e=" rel="noreferrer" target="_blank">http://reviews.llvm.org/D9509</a><br>> +<br>

> +CudaHostAction::CudaHostAction(std::unique_ptr<Action> Input,<br>

> +                               const ActionList &_DeviceActions)<br>

<br>

</div></div>Identifiers starting with an underscore than a capital are reserved.<br>

Please just call this DeviceActions.<br>

<span><br></span></blockquote><div><br></div><div>Done.</div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><span>

> +    : Action(CudaHostClass, std::move(Input)), DeviceActions(_DeviceActions) {}<br>

> +<br>

> +CudaHostAction::~CudaHostAction() {<br>

> +  for (iterator it = DeviceActions.begin(), ie = DeviceActions.end(); it != ie;<br>

<br>

</span>We usually name variables starting with a capital letter, so these<br>

should probably be `I` and `E` rather than `it` and `ie`. Might as well<br>

use auto here as well, since they're just iterators. <br></blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

This applies in several places later in this patch too. Please clean<br>

them up.<br></blockquote><div><br></div><div><br class="">Fixed here and in other places.</div><div>  </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

<div><div>> +  // Check whether any of device actions stopped before they could generate PTX.<br>

> +  bool PartialCompilation = false;<br>

> +  bool DeviceOnlyCompilation = Args.hasArg(options::OPT_cuda_device_only);<br>

<br>

</div></div>The ordering here is confusing - it looks like DeviceOnlyCompilation is<br>

related to the loop that sets PartialCompilation.<br>

<div><div><br></div></div></blockquote><div><br></div><div>Moved DeviceOnlyCompilation down to where it's used.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div><div>> +    for (unsigned i = 0, e = GpuArchList.size(); i != e; ++i)<br>

> +      Actions.push_back(<br>

> +          new CudaDeviceAction(std::unique_ptr<Action>(CudaDeviceActions[i]),<br>

> +                               GpuArchList[i], /* AtTopLevel */ true));<br>

> +    // Kill host action in case of device-only compilation.<br>

> +    if (DeviceOnlyCompilation)<br>

> +      Current.reset(nullptr);<br>

> +    return Current;<br>

> +  } else {<br>

<br>

</div></div>Since the `if` returns unconditionally, an early exit is better than an<br>

else.<br>

<div><div><br></div></div></blockquote><div><br></div><div>Done.</div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div><div>

> +    // Outputs of device actions during complete CUDA compilation get created<br>

> +    // with AtTopLevel=false and become inputs for the host action.<br>

> +    ActionList DeviceActions;<br>

> +    for (unsigned i = 0, e = GpuArchList.size(); i != e; ++i)<br>

> +      DeviceActions.push_back(<br>

> +          new CudaDeviceAction(std::unique_ptr<Action>(CudaDeviceActions[i]),<br>

> +                               GpuArchList[i], /* AtTopLevel */ false));<br>

> +    // Return a new host action that incorporates original host action and all<br>

> +    // device actions.<br>

> +    return std::unique_ptr<Action>(<br>

> +        new CudaHostAction(std::move(Current), DeviceActions));<br>

> +  }<br>

> +}<br>

> +<br>

>  void Driver::BuildActions(const ToolChain &TC, DerivedArgList &Args,<br>

>                            const InputList &Inputs, ActionList &Actions) const {<br>

>    llvm::PrettyStackTraceString CrashInfo("Building compilation actions");<br>

> @@ -1312,6 +1412,25 @@ void Driver::BuildActions(const ToolChai<br>

>        continue;<br>

>      }<br>

><br>

> +    phases::ID CudaInjectionPhase;<br>

> +    if (isSaveTempsEnabled()) {<br>

> +      // All phases are done independently, inject GPU blobs during compilation<br>

> +      // phase as that's where we generate glue code to init them.<br>

> +      CudaInjectionPhase = phases::Compile;<br>

> +    } else {<br>

> +      // Assumes that clang does everything up until linking phase, so we inject<br>

> +      // cuda device actions at the last step before linking. Otherwise CUDA<br>

> +      // host action forces preprocessor into a separate invocation.<br>

> +      if (FinalPhase == phases::Link) {<br>

<br>

</div></div>This is `else if`. Nesting the if inside the else is confusing.<br>

<span><br>

> +        for (auto i = PL.begin(), e = PL.end(); i != e; ++i) {<br>

> +          auto next = i + 1;<br>

> +          if (next != e && *next == phases::Link)<br>

<br>

</span>What if i == e?<br>

<span><br></span></blockquote><div> </div><div>Then CudaInjectionPhase would be left uninitialized which is a bug. Fixed, which also made moot 'else if' point you made above.</div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><span>

> +            CudaInjectionPhase = *i;<br>

> +        }<br>

> +      } else<br>

> +        CudaInjectionPhase = FinalPhase;<br>

> +    }<br>

> +<br>

>      // Build the pipeline for this file.<br>

>      std::unique_ptr<Action> Current(new InputAction(*InputArg, InputType));<br>

>      for (SmallVectorImpl<phases::ID>::iterator i = PL.begin(), e = PL.end();<br>

> @@ -1337,6 +1456,15 @@ void Driver::BuildActions(const ToolChai<br>

><br>

>        // Otherwise construct the appropriate action.<br>

>        Current = ConstructPhaseAction(TC, Args, Phase, std::move(Current));<br>

> +<br>

> +      if (InputType == types::TY_CUDA && Phase == CudaInjectionPhase &&<br>

> +          !Args.hasArg(options::OPT_cuda_host_only)) {<br>

> +        Current = buildCudaActions(*this, TC, Args, InputArg, InputType,<br>

> +                                   std::move(Current), Actions);<br>

> +        if (!Current)<br>

> +          break;<br>

> +      }<br>

<br>

</span>This whole block of code seems out of place. In fact, why doesn't CUDA<br>

just have its own phases::ID? It seems like that would simplify all of<br>

this.<br>

<span><br></span></blockquote><div><br></div><div>It does not quite fit the 'phase' concept. Phases implicitly assume that things are done linearly. CUDA, on the other hand, effectively builds a tree and the weird code above is a somewhat broken way to glue results of device-side builds into just the right action of the host build pipeline, so that results can be used to modify compiler arguments.</div><div><br></div><div>I've got a better way of dealing with that that I'll send as a separate patch for review soon.</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><span>

> +<br>

>        if (Current->getType() == types::TY_Nothing)<br>

>          break;<br>

>      }<br>

> @@ -1576,7 +1704,13 @@ static const Tool *SelectToolForJob(Comp<br>

>    if (isa<BackendJobAction>(JA)) {<br>

>      // Check if the compiler supports emitting LLVM IR.<br>

>      assert(Inputs->size() == 1);<br>

> -    JobAction *CompileJA = cast<CompileJobAction>(*Inputs->begin());<br>

> +    JobAction *CompileJA;<br>

> +    // Extract real host action, if it's a CudaHostAction.<br>

> +    if (CudaHostAction *CudaHA = dyn_cast<CudaHostAction>(*Inputs->begin()))<br>

> +      CompileJA = cast<CompileJobAction>(*CudaHA->begin());<br>

> +    else<br>

> +      CompileJA = cast<CompileJobAction>(*Inputs->begin());<br>

<br>

</span>This seems kind of awkward. Not sure I have a better suggestion though.<br>

<div><div><br></div></div></blockquote><div><br></div><div>This will also be changes in the upcoming patch.</div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div><div>>    if (const InputAction *IA = dyn_cast<InputAction>(A)) {<br>

>      // FIXME: It would be nice to not claim this here; maybe the old scheme of<br>

>      // just using Args was better?<br>

> @@ -1635,11 +1783,24 @@ void Driver::BuildJobsForAction(Compilat<br>

>      else<br>

>        TC = &C.getDefaultToolChain();<br>

><br>

> -    BuildJobsForAction(C, *BAA->begin(), TC, BAA->getArchName(), AtTopLevel,<br>

> +    BuildJobsForAction(C, *BAA->begin(), TC, ArchName, AtTopLevel,<br>

<br>

</div></div>I guess this is just an unrelated cleanup? Better to commit those<br>

separately in the future.<br>

<span><br></span></blockquote><div><br></div><div>Noted.</div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><span>

>                         MultipleArchs, LinkingOutput, Result);<br>

>      return;<br>

>    }<br>

><br>

> +  if (const CudaDeviceAction *CDA = dyn_cast<CudaDeviceAction>(A)) {<br>

> +    // Figure out which NVPTX triple to use for device-side compilation based on<br>

> +    // whether host is 64-bit.<br>

> +    llvm::Triple DeviceTriple(C.getDefaultToolChain().getTriple().isArch64Bit()<br>

> +                                  ? "nvptx64-nvidia-cuda"<br>

> +                                  : "nvptx-nvidia-cuda");<br>

<br>

</span>It seems like a pretty bad layering violation to need to know these<br>

particular triples right here...<br>

<br></blockquote><div><br></div><div>Well, it's up to driver to figure out device-side triple. While I agree that it's way too much target knowledge for the driver, I'm short on practical ideas. Do you have any suggestions?</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">Is getDefaultToolChain() even the right thing to do here? Isn't `TC`<br>

more appropriate?<br></blockquote><div><br></div><div>Good point. TC indeed should do.</div><div> </div></div><div><br></div>-- <br><div><div dir="ltr">--Artem Belevich</div></div>

</div></div></div>