[Openmp-dev] OpenMP GPU Target Offload in Clang

Tue Aug 21 09:37:02 PDT 2018

Hi Qiongsi,

On 08/21/2018 05:28 PM, Qiongsi Wu via Openmp-dev wrote:
> Hi, OpenMP dev community!
> 
> 
> Recently I tried setting up the OpenMP benchmarks for SPEC ACCEL and 
> test it with clang, but I ran into several difficulties.
> 
> 
> The core of the issue is that I was not able to get the workload onto 
> the GPUs.  I wrote the following small test
> 
> 
> //////////////////////////////////////////////////////////////////////////////////////////////////////////////
> 
> #define DATATYPE unsigned long long
> 
> /*gpu offload openmp*/
> DATATYPE reduce_gpu_omp(DATATYPE *arr, size_t size) {
>      DATATYPE result = IDENTITY;
> #pragma omp target data map(tofrom:arr[:size]) map(tofrom:result)
>      {
> #pragma omp target teams distribute parallel for reduction(+:result) 
> schedule(static, 1)
>          for (size_t i = 0; i < size; i++) {
>              result += arr[i];
>          }
>      }
>      return result;
> }
> 
> //////////////////////////////////////////////////////////////////////////////////////////////////////////////
> 

When compiling your code, but leaving out the reduction, I can execute 
this on a GPU. With the reduction the code seems to hang for me. 
(Posting a full compile-able example next time would be preferred!)

This is how I compiled:

clang -O3 -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -o reduce reduce.c

To see whether the code is actually executed on the device, you can add 
this to the loop for debugging:

if (i==0) printf("omp_is_initial_device=%i\n", omp_is_initial_device());

> And compiled that with clang trunk with the following commands:
> 
> clang -O3 -fopenmp -omptargets=nvptx64sm_35-nvidia-linux -Wall -o reduce 
> reduce.c
> clang -O3 -fopenmp -omptargets=nvptx64sm_35-nvidia-linux-cuda -Wall -o 
> reduce reduce.c
> clang -O3 -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -Wall -o reduce 
> reduce.c
> 
> The offloading to GPU was unsuccessful for all these commands. That 
> said, the CPU load did go up when the kernel above was run,  so the 
> offloading did happen, but the computation was offloaded to the CPU, not 
> the GPU.
> 
> My speculation is that I missed some steps setting up the 
> compiler/libraries and the offloading did not happen correctly. Or it 
> could be the fact that reductions were not supported across teams (as 
> stated here https://clang.llvm.org/docs/OpenMPSupport.html).
> 
> In the end, I would like to ask two questions:
> 
>  1. What is a good candidate of llvm based compiler to test OpenMP GPU
>     offloading?  Should clang-ykt be used instead of clang trunk?

I used clang trunk, compiled like:

cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=$INSTALL  \
       -DCLANG_OPENMP_NVPTX_DEFAULT_ARCH=sm_60 \
       -DLIBOMPTARGET_NVPTX_COMPUTE_CAPABILITIES=35,60 \
       $SRC

>  2. What is the recommended procedure for compiler and linker flags to
>     build programs with GPU offloading? Maybe I am not searching
>     correctly, but I was not able to find a documentation on how that is
>     supposed to be done. Additionally, will the compiler show some
>     warning if offloading to GPU is unsuccessful?

Successful / unsuccessful is a runtime decision. You will get an error 
and execution aborts, once this patch is submitted:

https://reviews.llvm.org/D50522

Best
Joachim

> 
> 
> Thanks for your help!
> 
> Sincerely,
> Qiongsi
> 
> 
> 
> _______________________________________________
> Openmp-dev mailing list
> Openmp-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/openmp-dev
>