[cfe-dev] compile a simple CUDA program using clang-3.4

Tue Aug 12 10:07:04 PDT 2014

hi everyone!

I tried to compile a CUDA program using clang-3.4. This program is taken
from NVIDIA_CUDA-5.0_samples collection and it's a very simple program
adding
two vectors.

A few modifications to the original code were done, including
1. I substituted the __global__ CUDA C keyword with __attribute__((global))
    in order to use clang as the compiler.
2. <stdlib.h> <math.h> were added.
3. declarations of blockDim, blockIdx, threadIdx were added.

// ==================== code begin ========================
/**

 * Vector addition: C = A + B.
 *
 * This sample is a very basic sample that implements element by element
 * vector addition. It is the same as the sample illustrating Chapter 2
 * of the programming guide with some additions like error checking.
 */

#include <stdio.h>

// For the CUDA runtime routines (prefixed with "cuda_")
#include <cuda_runtime.h>

#include <malloc.h>
#include <stdlib.h>
#include <math.h>
extern dim3 blockDim, blockIdx, threadIdx;

/**
 * CUDA Kernel Device code
 *
 * Computes the vector addition of A and B into C. The 3 vectors have the
same
 * number of elements numElements.
 */
__attribute__((global)) void
vectorAdd(const float *A, const float *B, float *C, int numElements)
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;

    if (i < numElements)
    {
        C[i] = A[i] + B[i];
    }
}

int main(void)
{
    ... ...
    return 0;
}

//==================== code end ========================

$ clang -std=cuda -I/usr/local/cuda-5.0/include -o vectorAdd1 vectorAdd1.cu
-L/usr/local/cuda-5.0/lib64 -lcudart

The compiling was successful, but running the program gives:

    [Vector addition of 50000 elements]
    Copy input data from the host memory to the CUDA device
    CUDA kernel launch with 196 blocks of 256 threads
    Failed to launch vectorAdd kernel (error code invalid device function )!

while the expected output is

    [Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
    Done

The result said that the vectorAdd function was not a valid kernel. To see
what happend,
I compiled the program to LLVM IR. Only the IR of function vectorAdd is
shown below.

$ clang -S -emit-llvm -std=cuda  -I/usr/local/cuda-5.0/include -o
vectorAdd1.ll vectorAdd1.cu

//==================== code begin ==========================
define void @_Z9vectorAddPKfS0_Pfi(float* %A, float* %B, float* %C, i32
%numElements) #0 {
  %1 = alloca float*, align 8
  %2 = alloca float*, align 8
  %3 = alloca float*, align 8
  %4 = alloca i32, align 4
  store float* %A, float** %1, align 8
  store float* %B, float** %2, align 8
  store float* %C, float** %3, align 8
  store i32 %numElements, i32* %4, align 4
  %5 = bitcast float** %1 to i8*
  %6 = call i32 @cudaSetupArgument(i8* %5, i64 ptrtoint (i1** getelementptr
(i1** null, i32 1) to i64), i64 0)
  %7 = icmp eq i32 %6, 0
  br i1 %7, label %8, label %22

; <label>:8                                       ; preds = %0
  %9 = bitcast float** %2 to i8*
  %10 = call i32 @cudaSetupArgument(i8* %9, i64 ptrtoint (i1**
getelementptr (i1** null, i32 1) to i64), i64 ptrtoint (float**
getelementptr ({ float*, float*, float*, i32 }* null, i64 0, i32 1) to i64))
  %11 = icmp eq i32 %10, 0
  br i1 %11, label %12, label %22

; <label>:12                                      ; preds = %8
  %13 = bitcast float** %3 to i8*
  %14 = call i32 @cudaSetupArgument(i8* %13, i64 ptrtoint (i1**
getelementptr (i1** null, i32 1) to i64), i64 ptrtoint (float**
getelementptr ({ float*, float*, float*, i32 }* null, i64 0, i32 2) to i64))
  %15 = icmp eq i32 %14, 0
  br i1 %15, label %16, label %22

; <label>:16                                      ; preds = %12
  %17 = bitcast i32* %4 to i8*
  %18 = call i32 @cudaSetupArgument(i8* %17, i64 ptrtoint (i32*
getelementptr (i32* null, i32 1) to i64), i64 ptrtoint (i32* getelementptr
({ float*, float*, float*, i32 }* null, i64 0, i32 3) to i64))
  %19 = icmp eq i32 %18, 0
  br i1 %19, label %20, label %22

; <label>:20                                      ; preds = %16
  %21 = call i32 @cudaLaunch(i8* bitcast (void (float*, float*, float*,
i32)* @_Z9vectorAddPKfS0_Pfi to i8*))
  br label %22

; <label>:22                                      ; preds = %20, %16, %12,
%8, %0
  ret void
}

//======================= code end =====================

>From the emitted IR code, I can see NO addrspace declarations, which should
be specific to
NVVM IR. So this may be the problem.

Could anyone tell me how to fix this?

Any help is appreciated!

                                   suxing
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20140813/70c93adb/attachment.html>