[PATCH] D35703: [GPGPU] Add support for NVIDIA libdevice

Mon Jul 24 22:40:47 PDT 2017

[It seems phabricator is down, so I reply by email]

Just to set the context. This patch is for polly.llvm.org, in case this
was not clear.

> ================
> Comment at: lib/CodeGen/PPCGCodeGeneration.cpp:113
>  static cl::opt<std::string>
>      CudaVersion("polly-acc-cuda-version",
>                  cl::desc("The CUDA version to compile for"), cl::Hidden,
> ----------------
> The variable appears to be unused.
> 
> On a side note, please consider that there's already a way to specify GPU
> variant for NVPTX back-end. This option is either going to be redundant
> or you'll need a good explanation for what's supposed to happen when its
> value conflicts with whatever GPU variant NVPTX back-end thinks it's
> supposed to generate the code for.

This option is already used since about one year to set the variant of
the NVPTX backend Polly is supposed to use. I am not sure if (or how) we
can get access to the current NVPTX backend, as the Polly-ACC CUDA
compilation is embedded into the Polly-ACC compilation path.

> ================
> Comment at: lib/CodeGen/PPCGCodeGeneration.cpp:118
>  static cl::opt<int>
>      MinCompute("polly-acc-mincompute",
>                 cl::desc("Minimal number of compute statements to run on
>                 GPU."),
> ----------------
> This also appears to be unused. 
> Please remove them from this patch and re-introduce them in subsequent
> patches that really need them.

This is used since about a year by this piece of code (which is not part
of this patch):

```
  __isl_give isl_ast_expr *
  createSufficientComputeCheck(Scop &S, __isl_keep isl_ast_build *Build)
  {       
    auto Iterations = getNumberOfIterations(S, Build);
    auto *MinComputeVal = isl_val_int_from_si(S.getIslCtx(),
    MinCompute);        
    auto *MinComputeExpr = isl_ast_expr_from_val(MinComputeVal);
    return isl_ast_expr_ge(Iterations, MinComputeExpr);
  }
```

> ================
> Comment at: lib/CodeGen/PPCGCodeGeneration.cpp:2277
> +    std::string CUDALibDeviceFunc = getCUDALibDeviceFuntion(&F);
> +    if (CUDALibDeviceFunc.length() != 0) {
> +      F.setName(CUDALibDeviceFunc);
> ----------------
> !empty()

Very good idea!

> ================
> Comment at: lib/CodeGen/PPCGCodeGeneration.cpp:2288
> +  if (Arch != GPUArch::NVPTX64)
> +    return;
> +
> ----------------
> What's supposed to happen in this case? Consider adding a diagnostic
> message.

I turned this into an assert.

> ================
> Comment at: lib/CodeGen/PPCGCodeGeneration.cpp:2293
> +
> +    errs() << CUDALibDevice << "\n";
> +    auto LibDeviceModule =
> ----------------
> But why are you still printing the libdevice name on stderr?

This was a leftover from debugging. I dropped it.

> ================
> Comment at: lib/CodeGen/PPCGCodeGeneration.cpp:2323
>      if (FailOnVerifyModuleFailure)
>        llvm_unreachable("VerifyModule failed.");
>  
> ----------------
> I'm curious -- when would it be OK to proceed if verifier has failed? 
> 
> Should  this option be other way around -- and make LLVM fail by default
> on verifier failure? That would be my expectation of normal behavior. One
> could conceivably use this option for debugging purposes to force LLVM to
> proceed even when verifier has failed, but in general I believe that by
> default the errors should be reported as early as possible.

This option is unrelated to this specific patch. Polly-ACC is a tool
that tries to auto-parallelize CPU code. Hence, in case of failures we
can always savely fall back to executing the original CPU code. We are
working towards changing the default of this option but still would need
to perform some more testing to ensure this assertion does not trigger
so often. The reason we introduced this in the first place, was that we
saw a couple of NVPTX backend bugs, which prevented us from compiling
larger software (as we always triggered some). With this option, we can
always run tests and then one-by-one report the bugs
we encounter.

> ================
> Comment at: lib/CodeGen/PPCGCodeGeneration.cpp:107-111
> +static cl::opt<std::string> LibDevice(
> +    "polly-acc-libdevice", cl::desc("Path to CUDA libdevice"),
> cl::Hidden,
> +   
> cl::init("/usr/local/cuda-7.5/nvvm/libdevice/libdevice.compute_20.10.bc"),
> +    cl::ZeroOrMore, cl::cat(PollyCategory));
> +
> ----------------
> grosser wrote:
> > grosser wrote:
> > > singam-sanjay wrote:
> > > > tra wrote:
> > > > > This is something that is useful for all NVPTX users and should probably live there and it should not have any hardcoded path -- it's too easy to end up silently picking wrong library otherwise.
> > > > > 
> > > > > Hardcoded compute_20 is also problematic because it should depend on particular GPU arch we're compiling for.
> > > > > 
> > > > > Considering that LLVM has no idea about CUDA SDL location, this is sommething that should always be explicitly specified. Either base path + libdevice name derived from GPU arch, or complete path to specific libdevice variant (i.e. it's completely up to the user to provide correct libdevice).
> > > > Would it be better to call this CUDALibDevice or CULibDevice  instead ? since this applies only to NVPTX
> > > I use CUDALibDevice
> > @tra: Thank you for your comment. This is the very first commit to introduce this feature. We currently are in early beta tests. The library location is supposed to be provided by the user by setting the path with polly-acc-libdevice. I set the option to a very basic default. For now I expect the user to adjust this default. In the future we can add some generic infrastructure to LLVM to derive this path automatically. If a specific fixed path is too confusing I can also use an empty default and always prompt for a path.
> I believe no default would be a better option in this case as it
> minimizes possibility for things to go wrong silently.

OK, I now use an empty default.

> ================
> Comment at: test/GPGPU/libdevice-functions-copied-into-kernel.ll:38-69
> +define void @f(float* %A, float* %B, i32 %N) {
> +entry:
> +  br label %entry.split
> +
> +entry.split:                                      ; preds = %entry
> +  %cmp1 = icmp sgt i32 %N, 0
> +  br i1 %cmp1, label %for.body.lr.ph, label %for.end
> ----------------
> grosser wrote:
> > tra wrote:
> > > Can the test be reduced to just expf() call?
> > Unlikely. This is a test for Polly-ACC, where we auto-offload to CUDA. For this we need at least some parallelism, which means some loop.
> It may be worth reconsidering your approach and making this functionality
> generic to NVPTX so it can benefit all users of NVPTX back-end. The
> functionality is generic enough to benefit CUDA (and possibly OpenCL)
> which currently fail miserably if any of standard library functions sneak
> into IR. 

This is sufficient for Polly-ACC, so I would like to commit to get some
more experience with the library.

I am happy to start a discussion about adding this functionality to the
NVPTX backend. My main concern is that this libraries won't be always
available on the system code is compiled at, which we need to find a way
to work with. Should I start a thread on llvm-dev on this topic?

Best,
Tobias