r242085 - [cuda] Driver changes to compile and stitch together host and device-side CUDA code.

Thu Jul 16 15:28:24 PDT 2015

Artem Belevich <tra at google.com> writes:
> Justin,
>
> Thank you for your comments. Please take a look at the patch that addresses
> them:
>
> http://reviews.llvm.org/D11273

Thanks.

> On Mon, Jul 13, 2015 at 11:39 PM, Justin Bogner <mail at justinbogner.com> wrote:
>
>     Artem Belevich <tra at google.com> writes:
>     > Author: tra
>     > Date: Mon Jul 13 18:27:56 2015
>     > New Revision: 242085
>     >
>     > URL: http://llvm.org/viewvc/llvm-project?rev=242085&view=rev
>     > Log:
>     > [cuda] Driver changes to compile and stitch together host and
>     device-side CUDA code.
>     >
>     >   NOTE: reverts r242077 to reinstate r242058, r242065, 242067
>     >         and includes fix for OS X test failures.
>     >
>     >   - Changed driver pipeline to compile host and device side of CUDA
>     >     files and incorporate results of device-side compilation into host
>     >     object file.
>     >
>     >   - Added a test for cuda pipeline creation in clang driver.
>     >
>     >   New clang options:
>     >   --cuda-host-only   - Do host-side compilation only.
>     >   --cuda-device-only - Do device-side compilation only.
>     >
>     >   --cuda-gpu-arch=<ARCH> - specify GPU architecture for device-side
>     >     compilation. E.g. sm_35, sm_30. Default is sm_20. May be used more
>     >     than once in which case one device-compilation will be done per
>     >     unique specified GPU architecture.
>     >
>     >   Differential Revision: http://reviews.llvm.org/D9509
>     > +
>     > +CudaHostAction::CudaHostAction(std::unique_ptr<Action> Input,
>     > +                               const ActionList &_DeviceActions)
>    
>     Identifiers starting with an underscore than a capital are reserved.
>     Please just call this DeviceActions.
>
> Done.
>
>  
>
>     > +    : Action(CudaHostClass, std::move(Input)), DeviceActions
>     (_DeviceActions) {}
>     > +
>     > +CudaHostAction::~CudaHostAction() {
>     > +  for (iterator it = DeviceActions.begin(), ie = DeviceActions.end();
>     it != ie;
>    
>     We usually name variables starting with a capital letter, so these
>     should probably be `I` and `E` rather than `it` and `ie`. Might as well
>     use auto here as well, since they're just iterators. 
>
>     This applies in several places later in this patch too. Please clean
>     them up.
>
> Fixed here and in other places.
>   
>
>     > +  // Check whether any of device actions stopped before they could
>     generate PTX.
>     > +  bool PartialCompilation = false;
>     > +  bool DeviceOnlyCompilation = Args.hasArg
>     (options::OPT_cuda_device_only);
>    
>     The ordering here is confusing - it looks like DeviceOnlyCompilation is
>     related to the loop that sets PartialCompilation.
>
> Moved DeviceOnlyCompilation down to where it's used.
>  
>
>     > +    for (unsigned i = 0, e = GpuArchList.size(); i != e; ++i)
>     > +      Actions.push_back(
>     > +          new CudaDeviceAction(std::unique_ptr<Action>
>     (CudaDeviceActions[i]),
>     > +                               GpuArchList[i], /* AtTopLevel */ true));
>     > +    // Kill host action in case of device-only compilation.
>     > +    if (DeviceOnlyCompilation)
>     > +      Current.reset(nullptr);
>     > +    return Current;
>     > +  } else {
>    
>     Since the `if` returns unconditionally, an early exit is better than an
>     else.
>
> Done.
>
>  
>
>     > +    // Outputs of device actions during complete CUDA compilation get
>     created
>     > +    // with AtTopLevel=false and become inputs for the host action.
>     > +    ActionList DeviceActions;
>     > +    for (unsigned i = 0, e = GpuArchList.size(); i != e; ++i)
>     > +      DeviceActions.push_back(
>     > +          new CudaDeviceAction(std::unique_ptr<Action>
>     (CudaDeviceActions[i]),
>     > +                               GpuArchList[i], /* AtTopLevel */
>     false));
>     > +    // Return a new host action that incorporates original host action
>     and all
>     > +    // device actions.
>     > +    return std::unique_ptr<Action>(
>     > +        new CudaHostAction(std::move(Current), DeviceActions));
>     > +  }
>     > +}
>     > +
>     >  void Driver::BuildActions(const ToolChain &TC, DerivedArgList &Args,
>     >                            const InputList &Inputs, ActionList &Actions)
>     const {
>     >    llvm::PrettyStackTraceString CrashInfo("Building compilation
>     actions");
>     > @@ -1312,6 +1412,25 @@ void Driver::BuildActions(const ToolChai
>     >        continue;
>     >      }
>     >
>     > +    phases::ID CudaInjectionPhase;
>     > +    if (isSaveTempsEnabled()) {
>     > +      // All phases are done independently, inject GPU blobs during
>     compilation
>     > +      // phase as that's where we generate glue code to init them.
>     > +      CudaInjectionPhase = phases::Compile;
>     > +    } else {
>     > +      // Assumes that clang does everything up until linking phase, so
>     we inject
>     > +      // cuda device actions at the last step before linking. Otherwise
>     CUDA
>     > +      // host action forces preprocessor into a separate invocation.
>     > +      if (FinalPhase == phases::Link) {
>    
>     This is `else if`. Nesting the if inside the else is confusing.
>    
>     > +        for (auto i = PL.begin(), e = PL.end(); i != e; ++i) {
>     > +          auto next = i + 1;
>     > +          if (next != e && *next == phases::Link)
>    
>     What if i == e?
>
>  
> Then CudaInjectionPhase would be left uninitialized which is a bug. Fixed,
> which also made moot 'else if' point you made above.
>
>  
>
>     > +            CudaInjectionPhase = *i;
>     > +        }
>     > +      } else
>     > +        CudaInjectionPhase = FinalPhase;
>     > +    }
>     > +
>     >      // Build the pipeline for this file.
>     >      std::unique_ptr<Action> Current(new InputAction(*InputArg,
>     InputType));
>     >      for (SmallVectorImpl<phases::ID>::iterator i = PL.begin(), e =
>     PL.end();
>     > @@ -1337,6 +1456,15 @@ void Driver::BuildActions(const ToolChai
>     >
>     >        // Otherwise construct the appropriate action.
>     >        Current = ConstructPhaseAction(TC, Args, Phase, std::move
>     (Current));
>     > +
>     > +      if (InputType == types::TY_CUDA && Phase == CudaInjectionPhase &&
>     > +          !Args.hasArg(options::OPT_cuda_host_only)) {
>     > +        Current = buildCudaActions(*this, TC, Args, InputArg,
>     InputType,
>     > +                                   std::move(Current), Actions);
>     > +        if (!Current)
>     > +          break;
>     > +      }
>    
>     This whole block of code seems out of place. In fact, why doesn't CUDA
>     just have its own phases::ID? It seems like that would simplify all of
>     this.
>
> It does not quite fit the 'phase' concept. Phases implicitly assume that
> things are done linearly. CUDA, on the other hand, effectively builds a tree
> and the weird code above is a somewhat broken way to glue results of
> device-side builds into just the right action of the host build pipeline, so
> that results can be used to modify compiler arguments.

Well, I wouldn't say they assume things are done linearly - the phases
are already used to build trees today. The link phase, for example, will
depend on multiple different compile branches.

The difference here is just that we want multiple branches for the same
input file (which means we end up with a DAG instead of a tree I
suppose). I guess we don't have much infrastructure to support this more
cleanly right now - I see that BindArchActions are already handled in a
similarly awkward way.

> I've got a better way of dealing with that that I'll send as a separate patch
> for review soon.

Looking forward to it.

>     > +
>     >        if (Current->getType() == types::TY_Nothing)
>     >          break;
>     >      }
>     > @@ -1576,7 +1704,13 @@ static const Tool *SelectToolForJob(Comp
>     >    if (isa<BackendJobAction>(JA)) {
>     >      // Check if the compiler supports emitting LLVM IR.
>     >      assert(Inputs->size() == 1);
>     > -    JobAction *CompileJA = cast<CompileJobAction>(*Inputs->begin());
>     > +    JobAction *CompileJA;
>     > +    // Extract real host action, if it's a CudaHostAction.
>     > +    if (CudaHostAction *CudaHA = dyn_cast<CudaHostAction>(*Inputs->
>     begin()))
>     > +      CompileJA = cast<CompileJobAction>(*CudaHA->begin());
>     > +    else
>     > +      CompileJA = cast<CompileJobAction>(*Inputs->begin());
>    
>     This seems kind of awkward. Not sure I have a better suggestion though.
>
> This will also be changes in the upcoming patch.
>
>  
>
>     >    if (const InputAction *IA = dyn_cast<InputAction>(A)) {
>     >      // FIXME: It would be nice to not claim this here; maybe the old
>     scheme of
>     >      // just using Args was better?
>     > @@ -1635,11 +1783,24 @@ void Driver::BuildJobsForAction(Compilat
>     >      else
>     >        TC = &C.getDefaultToolChain();
>     >
>     > -    BuildJobsForAction(C, *BAA->begin(), TC, BAA->getArchName(),
>     AtTopLevel,
>     > +    BuildJobsForAction(C, *BAA->begin(), TC, ArchName, AtTopLevel,
>    
>     I guess this is just an unrelated cleanup? Better to commit those
>     separately in the future.
>
> Noted.
>
>  
>
>     >                         MultipleArchs, LinkingOutput, Result);
>     >      return;
>     >    }
>     >
>     > +  if (const CudaDeviceAction *CDA = dyn_cast<CudaDeviceAction>(A)) {
>     > +    // Figure out which NVPTX triple to use for device-side compilation
>     based on
>     > +    // whether host is 64-bit.
>     > +    llvm::Triple DeviceTriple(C.getDefaultToolChain().getTriple
>     ().isArch64Bit()
>     > +                                  ? "nvptx64-nvidia-cuda"
>     > +                                  : "nvptx-nvidia-cuda");
>    
>     It seems like a pretty bad layering violation to need to know these
>     particular triples right here...
>
> Well, it's up to driver to figure out device-side triple. While I agree that
> it's way too much target knowledge for the driver, I'm short on practical
> ideas. Do you have any suggestions?

I guess I might expect this to be figured out in buildCudaActions, near
where we figure out the device architectures. I guess this triple is
more based on the host-side target though?

This is already a little bit broken, but ideally BuildJobsForAction
would just do the tree traversal, and the details about how to do that
are abstracted elsewhere. I guess we just don't have a very good story
about how to deal with multiple toolchains in the same build pipeline
today.

>
>     Is getDefaultToolChain() even the right thing to do here? Isn't `TC`
>     more appropriate?
>
> Good point. TC indeed should do.
>  
>
> --
> --Artem Belevich