[cfe-dev] compile a simple CUDA program using clang-3.4

Tue Aug 12 10:18:51 PDT 2014

On Tue, Aug 12, 2014 at 10:07 AM, Xing Su <suxing1989 at gmail.com> wrote:

> hi everyone!
>
> I tried to compile a CUDA program using clang-3.4. This program is taken
> from NVIDIA_CUDA-5.0_samples collection and it's a very simple program
> adding
> two vectors.
>
> A few modifications to the original code were done, including
> 1. I substituted the __global__ CUDA C keyword with __attribute__((global))
>     in order to use clang as the compiler.
> 2. <stdlib.h> <math.h> were added.
> 3. declarations of blockDim, blockIdx, threadIdx were added.
>
> // ==================== code begin ========================
> /**
>
>  * Vector addition: C = A + B.
>  *
>  * This sample is a very basic sample that implements element by element
>  * vector addition. It is the same as the sample illustrating Chapter 2
>  * of the programming guide with some additions like error checking.
>  */
>
> #include <stdio.h>
>
> // For the CUDA runtime routines (prefixed with "cuda_")
> #include <cuda_runtime.h>
>
> #include <malloc.h>
> #include <stdlib.h>
> #include <math.h>
> extern dim3 blockDim, blockIdx, threadIdx;
>
> /**
>  * CUDA Kernel Device code
>  *
>  * Computes the vector addition of A and B into C. The 3 vectors have the
> same
>  * number of elements numElements.
>  */
> __attribute__((global)) void
> vectorAdd(const float *A, const float *B, float *C, int numElements)
> {
>     int i = blockDim.x * blockIdx.x + threadIdx.x;
>
>     if (i < numElements)
>     {
>         C[i] = A[i] + B[i];
>     }
> }
>
> int main(void)
> {
>     ... ...
>     return 0;
> }
>
> //==================== code end ========================
>
> $ clang -std=cuda -I/usr/local/cuda-5.0/include -o vectorAdd1
> vectorAdd1.cu -L/usr/local/cuda-5.0/lib64 -lcudart
>
> The compiling was successful, but running the program gives:
>
>     [Vector addition of 50000 elements]
>     Copy input data from the host memory to the CUDA device
>     CUDA kernel launch with 196 blocks of 256 threads
>     Failed to launch vectorAdd kernel (error code invalid device function
> )!
>
> while the expected output is
>
>     [Vector addition of 50000 elements]
> Copy input data from the host memory to the CUDA device
> CUDA kernel launch with 196 blocks of 256 threads
> Copy output data from the CUDA device to the host memory
>     Done
>
> The result said that the vectorAdd function was not a valid kernel. To see
> what happend,
> I compiled the program to LLVM IR. Only the IR of function vectorAdd is
> shown below.
>
> $ clang -S -emit-llvm -std=cuda  -I/usr/local/cuda-5.0/include -o
> vectorAdd1.ll vectorAdd1.cu
>
> //==================== code begin ==========================
> define void @_Z9vectorAddPKfS0_Pfi(float* %A, float* %B, float* %C, i32
> %numElements) #0 {
>   %1 = alloca float*, align 8
>   %2 = alloca float*, align 8
>   %3 = alloca float*, align 8
>   %4 = alloca i32, align 4
>   store float* %A, float** %1, align 8
>   store float* %B, float** %2, align 8
>   store float* %C, float** %3, align 8
>   store i32 %numElements, i32* %4, align 4
>   %5 = bitcast float** %1 to i8*
>   %6 = call i32 @cudaSetupArgument(i8* %5, i64 ptrtoint (i1**
> getelementptr (i1** null, i32 1) to i64), i64 0)
>   %7 = icmp eq i32 %6, 0
>   br i1 %7, label %8, label %22
>
> ; <label>:8                                       ; preds = %0
>   %9 = bitcast float** %2 to i8*
>   %10 = call i32 @cudaSetupArgument(i8* %9, i64 ptrtoint (i1**
> getelementptr (i1** null, i32 1) to i64), i64 ptrtoint (float**
> getelementptr ({ float*, float*, float*, i32 }* null, i64 0, i32 1) to i64))
>   %11 = icmp eq i32 %10, 0
>   br i1 %11, label %12, label %22
>
> ; <label>:12                                      ; preds = %8
>   %13 = bitcast float** %3 to i8*
>   %14 = call i32 @cudaSetupArgument(i8* %13, i64 ptrtoint (i1**
> getelementptr (i1** null, i32 1) to i64), i64 ptrtoint (float**
> getelementptr ({ float*, float*, float*, i32 }* null, i64 0, i32 2) to i64))
>   %15 = icmp eq i32 %14, 0
>   br i1 %15, label %16, label %22
>
> ; <label>:16                                      ; preds = %12
>   %17 = bitcast i32* %4 to i8*
>   %18 = call i32 @cudaSetupArgument(i8* %17, i64 ptrtoint (i32*
> getelementptr (i32* null, i32 1) to i64), i64 ptrtoint (i32* getelementptr
> ({ float*, float*, float*, i32 }* null, i64 0, i32 3) to i64))
>   %19 = icmp eq i32 %18, 0
>   br i1 %19, label %20, label %22
>
> ; <label>:20                                      ; preds = %16
>   %21 = call i32 @cudaLaunch(i8* bitcast (void (float*, float*, float*,
> i32)* @_Z9vectorAddPKfS0_Pfi to i8*))
>   br label %22
>
> ; <label>:22                                      ; preds = %20, %16, %12,
> %8, %0
>   ret void
> }
>
> //======================= code end =====================
>
> From the emitted IR code, I can see NO addrspace declarations, which
> should be specific to
> NVVM IR. So this may be the problem.
>
> Could anyone tell me how to fix this?
>
> Any help is appreciated!
>
>                                    suxing
>
Hi Suxing,

Clang cannot yet compile CUDA out-of-the-box like that. Definitely not
mixed CUDA code (where host and device code are in the same file). Clang
can be made, with some effort, to compile stand-alone device code, but some
critical steps are missing. For example, you have to map threadIdx and
other similar special globals to appropriate intrinsic calls
(@llvm.nvvm.read.ptx.sreg.tid.*), and not just declare them.

Eli
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20140812/8e7af69a/attachment.html>