r242085 - [cuda] Driver changes to compile and stitch together host and device-side CUDA code.

Fri Jul 17 13:01:18 PDT 2015

On Thu, Jul 16, 2015 at 3:28 PM, Justin Bogner <mail at justinbogner.com>
wrote:

> >     > +            CudaInjectionPhase = *i;
> >     > +        }
> >     > +      } else
> >     > +        CudaInjectionPhase = FinalPhase;
> >     > +    }
> >     > +
> >     >      // Build the pipeline for this file.
> >     >      std::unique_ptr<Action> Current(new InputAction(*InputArg,
> >     InputType));
> >     >      for (SmallVectorImpl<phases::ID>::iterator i = PL.begin(), e =
> >     PL.end();
> >     > @@ -1337,6 +1456,15 @@ void Driver::BuildActions(const ToolChai
> >     >
> >     >        // Otherwise construct the appropriate action.
> >     >        Current = ConstructPhaseAction(TC, Args, Phase, std::move
> >     (Current));
> >     > +
> >     > +      if (InputType == types::TY_CUDA && Phase ==
> CudaInjectionPhase &&
> >     > +          !Args.hasArg(options::OPT_cuda_host_only)) {
> >     > +        Current = buildCudaActions(*this, TC, Args, InputArg,
> >     InputType,
> >     > +                                   std::move(Current), Actions);
> >     > +        if (!Current)
> >     > +          break;
> >     > +      }
> >
> >     This whole block of code seems out of place. In fact, why doesn't
> CUDA
> >     just have its own phases::ID? It seems like that would simplify all
> of
> >     this.
> >
> > It does not quite fit the 'phase' concept. Phases implicitly assume that
> > things are done linearly. CUDA, on the other hand, effectively builds a
> tree
> > and the weird code above is a somewhat broken way to glue results of
> > device-side builds into just the right action of the host build
> pipeline, so
> > that results can be used to modify compiler arguments.
>
> Well, I wouldn't say they assume things are done linearly - the phases
> are already used to build trees today. The link phase, for example, will
> depend on multiple different compile branches.
>
>
Let me rephrase -- selectToolForJob() assumes that actions are a arranged
in a linear list and that it can short-circuit traversal assuming that fact.

> The difference here is just that we want multiple branches for the same
> input file (which means we end up with a DAG instead of a tree I
> suppose).

Current implementation supplies input source file to device actions
independently, so Action graph is still a tree, though it could be made a
DAG. It does not really matter in this case, though. A lot of code in
lib/Driver assumes (sometimes explicitly with an assert, but often
implicitly by ignoring inputs other than the very first one) that we ever
deal with one input only. There are few exceptions but by and large they
are just that -- exceptions. CUDA case is complicated by the fact that
device side output does require special handling.

> I guess we don't have much infrastructure to support this more
> cleanly right now - I see that BindArchActions are already handled in a
> similarly awkward way.
>
> > I've got a better way of dealing with that that I'll send as a separate
> patch
> > for review soon.
>
> Looking forward to it.
>

See http://reviews.llvm.org/D11280
A've added you to the reviewer list.

>
> >     > +
> >     >        if (Current->getType() == types::TY_Nothing)
> >     >          break;
> >     >      }
> >     > @@ -1576,7 +1704,13 @@ static const Tool *SelectToolForJob(Comp
> >     >    if (isa<BackendJobAction>(JA)) {
> >     >      // Check if the compiler supports emitting LLVM IR.
> >     >      assert(Inputs->size() == 1);
> >     > -    JobAction *CompileJA =
> cast<CompileJobAction>(*Inputs->begin());
> >     > +    JobAction *CompileJA;
> >     > +    // Extract real host action, if it's a CudaHostAction.
> >     > +    if (CudaHostAction *CudaHA =
> dyn_cast<CudaHostAction>(*Inputs->
> >     begin()))
> >     > +      CompileJA = cast<CompileJobAction>(*CudaHA->begin());
> >     > +    else
> >     > +      CompileJA = cast<CompileJobAction>(*Inputs->begin());
> >
> >     This seems kind of awkward. Not sure I have a better suggestion
> though.
> >
> > This will also be changes in the upcoming patch.
> >
> >
> >
> >     >    if (const InputAction *IA = dyn_cast<InputAction>(A)) {
> >     >      // FIXME: It would be nice to not claim this here; maybe the
> old
> >     scheme of
> >     >      // just using Args was better?
> >     > @@ -1635,11 +1783,24 @@ void Driver::BuildJobsForAction(Compilat
> >     >      else
> >     >        TC = &C.getDefaultToolChain();
> >     >
> >     > -    BuildJobsForAction(C, *BAA->begin(), TC, BAA->getArchName(),
> >     AtTopLevel,
> >     > +    BuildJobsForAction(C, *BAA->begin(), TC, ArchName, AtTopLevel,
> >
> >     I guess this is just an unrelated cleanup? Better to commit those
> >     separately in the future.
> >
> > Noted.
> >
> >
> >
> >     >                         MultipleArchs, LinkingOutput, Result);
> >     >      return;
> >     >    }
> >     >
> >     > +  if (const CudaDeviceAction *CDA =
> dyn_cast<CudaDeviceAction>(A)) {
> >     > +    // Figure out which NVPTX triple to use for device-side
> compilation
> >     based on
> >     > +    // whether host is 64-bit.
> >     > +    llvm::Triple DeviceTriple(C.getDefaultToolChain().getTriple
> >     ().isArch64Bit()
> >     > +                                  ? "nvptx64-nvidia-cuda"
> >     > +                                  : "nvptx-nvidia-cuda");
> >
> >     It seems like a pretty bad layering violation to need to know these
> >     particular triples right here...
> >
> > Well, it's up to driver to figure out device-side triple. While I agree
> that
> > it's way too much target knowledge for the driver, I'm short on practical
> > ideas. Do you have any suggestions?
>
> I guess I might expect this to be figured out in buildCudaActions, near
> where we figure out the device architectures. I guess this triple is
> more based on the host-side target though?
>

Device triple is assumed to be a variant of NVPTX matching 32/64-bitness of
the host.
I guess I can compute the triple and stash it in CudaDeviceAction along
with GPU arch.

>
> This is already a little bit broken, but ideally BuildJobsForAction
> would just do the tree traversal, and the details about how to do that
> are abstracted elsewhere.

Makes sense. I'll move triple calculation to buildCudaActions.

> I guess we just don't have a very good story
> about how to deal with multiple toolchains in the same build pipeline
> today.
>
>

-- 
--Artem Belevich
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-commits/attachments/20150717/357ecbe4/attachment.html>