r242085 - [cuda] Driver changes to compile and stitch together host and device-side CUDA code.

Thu Jul 16 11:31:25 PDT 2015

Justin,

Thank you for your comments. Please take a look at the patch that addresses
them:

http://reviews.llvm.org/D11273

On Mon, Jul 13, 2015 at 11:39 PM, Justin Bogner <mail at justinbogner.com>
wrote:

> Artem Belevich <tra at google.com> writes:
> > Author: tra
> > Date: Mon Jul 13 18:27:56 2015
> > New Revision: 242085
> >
> > URL: http://llvm.org/viewvc/llvm-project?rev=242085&view=rev
> > Log:
> > [cuda] Driver changes to compile and stitch together host and
> device-side CUDA code.
> >
> >   NOTE: reverts r242077 to reinstate r242058, r242065, 242067
> >         and includes fix for OS X test failures.
> >
> >   - Changed driver pipeline to compile host and device side of CUDA
> >     files and incorporate results of device-side compilation into host
> >     object file.
> >
> >   - Added a test for cuda pipeline creation in clang driver.
> >
> >   New clang options:
> >   --cuda-host-only   - Do host-side compilation only.
> >   --cuda-device-only - Do device-side compilation only.
> >
> >   --cuda-gpu-arch=<ARCH> - specify GPU architecture for device-side
> >     compilation. E.g. sm_35, sm_30. Default is sm_20. May be used more
> >     than once in which case one device-compilation will be done per
> >     unique specified GPU architecture.
> >
> >   Differential Revision: http://reviews.llvm.org/D9509
> > +
> > +CudaHostAction::CudaHostAction(std::unique_ptr<Action> Input,
> > +                               const ActionList &_DeviceActions)
>
> Identifiers starting with an underscore than a capital are reserved.
> Please just call this DeviceActions.
>
>
Done.

> > +    : Action(CudaHostClass, std::move(Input)),
> DeviceActions(_DeviceActions) {}
> > +
> > +CudaHostAction::~CudaHostAction() {
> > +  for (iterator it = DeviceActions.begin(), ie = DeviceActions.end();
> it != ie;
>
> We usually name variables starting with a capital letter, so these
> should probably be `I` and `E` rather than `it` and `ie`. Might as well
> use auto here as well, since they're just iterators.
>
This applies in several places later in this patch too. Please clean
> them up.
>

Fixed here and in other places.

> > +  // Check whether any of device actions stopped before they could
> generate PTX.
> > +  bool PartialCompilation = false;
> > +  bool DeviceOnlyCompilation =
> Args.hasArg(options::OPT_cuda_device_only);
>
> The ordering here is confusing - it looks like DeviceOnlyCompilation is
> related to the loop that sets PartialCompilation.
>
>
Moved DeviceOnlyCompilation down to where it's used.

> > +    for (unsigned i = 0, e = GpuArchList.size(); i != e; ++i)
> > +      Actions.push_back(
> > +          new
> CudaDeviceAction(std::unique_ptr<Action>(CudaDeviceActions[i]),
> > +                               GpuArchList[i], /* AtTopLevel */ true));
> > +    // Kill host action in case of device-only compilation.
> > +    if (DeviceOnlyCompilation)
> > +      Current.reset(nullptr);
> > +    return Current;
> > +  } else {
>
> Since the `if` returns unconditionally, an early exit is better than an
> else.
>
>
Done.

> > +    // Outputs of device actions during complete CUDA compilation get
> created
> > +    // with AtTopLevel=false and become inputs for the host action.
> > +    ActionList DeviceActions;
> > +    for (unsigned i = 0, e = GpuArchList.size(); i != e; ++i)
> > +      DeviceActions.push_back(
> > +          new
> CudaDeviceAction(std::unique_ptr<Action>(CudaDeviceActions[i]),
> > +                               GpuArchList[i], /* AtTopLevel */ false));
> > +    // Return a new host action that incorporates original host action
> and all
> > +    // device actions.
> > +    return std::unique_ptr<Action>(
> > +        new CudaHostAction(std::move(Current), DeviceActions));
> > +  }
> > +}
> > +
> >  void Driver::BuildActions(const ToolChain &TC, DerivedArgList &Args,
> >                            const InputList &Inputs, ActionList &Actions)
> const {
> >    llvm::PrettyStackTraceString CrashInfo("Building compilation
> actions");
> > @@ -1312,6 +1412,25 @@ void Driver::BuildActions(const ToolChai
> >        continue;
> >      }
> >
> > +    phases::ID CudaInjectionPhase;
> > +    if (isSaveTempsEnabled()) {
> > +      // All phases are done independently, inject GPU blobs during
> compilation
> > +      // phase as that's where we generate glue code to init them.
> > +      CudaInjectionPhase = phases::Compile;
> > +    } else {
> > +      // Assumes that clang does everything up until linking phase, so
> we inject
> > +      // cuda device actions at the last step before linking. Otherwise
> CUDA
> > +      // host action forces preprocessor into a separate invocation.
> > +      if (FinalPhase == phases::Link) {
>
> This is `else if`. Nesting the if inside the else is confusing.
>
> > +        for (auto i = PL.begin(), e = PL.end(); i != e; ++i) {
> > +          auto next = i + 1;
> > +          if (next != e && *next == phases::Link)
>
> What if i == e?
>
>
Then CudaInjectionPhase would be left uninitialized which is a bug. Fixed,
which also made moot 'else if' point you made above.

> > +            CudaInjectionPhase = *i;
> > +        }
> > +      } else
> > +        CudaInjectionPhase = FinalPhase;
> > +    }
> > +
> >      // Build the pipeline for this file.
> >      std::unique_ptr<Action> Current(new InputAction(*InputArg,
> InputType));
> >      for (SmallVectorImpl<phases::ID>::iterator i = PL.begin(), e =
> PL.end();
> > @@ -1337,6 +1456,15 @@ void Driver::BuildActions(const ToolChai
> >
> >        // Otherwise construct the appropriate action.
> >        Current = ConstructPhaseAction(TC, Args, Phase,
> std::move(Current));
> > +
> > +      if (InputType == types::TY_CUDA && Phase == CudaInjectionPhase &&
> > +          !Args.hasArg(options::OPT_cuda_host_only)) {
> > +        Current = buildCudaActions(*this, TC, Args, InputArg, InputType,
> > +                                   std::move(Current), Actions);
> > +        if (!Current)
> > +          break;
> > +      }
>
> This whole block of code seems out of place. In fact, why doesn't CUDA
> just have its own phases::ID? It seems like that would simplify all of
> this.
>
>
It does not quite fit the 'phase' concept. Phases implicitly assume that
things are done linearly. CUDA, on the other hand, effectively builds a
tree and the weird code above is a somewhat broken way to glue results of
device-side builds into just the right action of the host build pipeline,
so that results can be used to modify compiler arguments.

I've got a better way of dealing with that that I'll send as a separate
patch for review soon.

> +
> >        if (Current->getType() == types::TY_Nothing)
> >          break;
> >      }
> > @@ -1576,7 +1704,13 @@ static const Tool *SelectToolForJob(Comp
> >    if (isa<BackendJobAction>(JA)) {
> >      // Check if the compiler supports emitting LLVM IR.
> >      assert(Inputs->size() == 1);
> > -    JobAction *CompileJA = cast<CompileJobAction>(*Inputs->begin());
> > +    JobAction *CompileJA;
> > +    // Extract real host action, if it's a CudaHostAction.
> > +    if (CudaHostAction *CudaHA =
> dyn_cast<CudaHostAction>(*Inputs->begin()))
> > +      CompileJA = cast<CompileJobAction>(*CudaHA->begin());
> > +    else
> > +      CompileJA = cast<CompileJobAction>(*Inputs->begin());
>
> This seems kind of awkward. Not sure I have a better suggestion though.
>
>
This will also be changes in the upcoming patch.

> >    if (const InputAction *IA = dyn_cast<InputAction>(A)) {
> >      // FIXME: It would be nice to not claim this here; maybe the old
> scheme of
> >      // just using Args was better?
> > @@ -1635,11 +1783,24 @@ void Driver::BuildJobsForAction(Compilat
> >      else
> >        TC = &C.getDefaultToolChain();
> >
> > -    BuildJobsForAction(C, *BAA->begin(), TC, BAA->getArchName(),
> AtTopLevel,
> > +    BuildJobsForAction(C, *BAA->begin(), TC, ArchName, AtTopLevel,
>
> I guess this is just an unrelated cleanup? Better to commit those
> separately in the future.
>
>
Noted.

> >                         MultipleArchs, LinkingOutput, Result);
> >      return;
> >    }
> >
> > +  if (const CudaDeviceAction *CDA = dyn_cast<CudaDeviceAction>(A)) {
> > +    // Figure out which NVPTX triple to use for device-side compilation
> based on
> > +    // whether host is 64-bit.
> > +    llvm::Triple
> DeviceTriple(C.getDefaultToolChain().getTriple().isArch64Bit()
> > +                                  ? "nvptx64-nvidia-cuda"
> > +                                  : "nvptx-nvidia-cuda");
>
> It seems like a pretty bad layering violation to need to know these
> particular triples right here...
>
>
Well, it's up to driver to figure out device-side triple. While I agree
that it's way too much target knowledge for the driver, I'm short on
practical ideas. Do you have any suggestions?

> Is getDefaultToolChain() even the right thing to do here? Isn't `TC`
> more appropriate?
>

Good point. TC indeed should do.

-- 
--Artem Belevich
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-commits/attachments/20150716/f52b93b1/attachment.html>