[cfe-dev] compile a simple CUDA program using clang-3.4

Xing Su suxing1989 at gmail.com
Wed Aug 13 17:39:13 PDT 2014


Ok, I'll check the test suite then. Thanks for your help Eli :-)

I'm trying to manually fix the generated IR file and use libnvvm to 
compile it to device code.

On 08/13/2014 10:55 PM, Eli Bendersky wrote:
>
>
>
> On Tue, Aug 12, 2014 at 6:38 PM, Xing Su <suxing1989 at gmail.com 
> <mailto:suxing1989 at gmail.com>> wrote:
>
>     Thanks Eli,
>
>     It sounds that at the moment support to CUDA in Clang is far from
>     production use ...
>
>     I'd like to know what the status of CUDA support is in clang,
>     but I am not able to find anything reporting this.
>     Are you a developer of this part, or could you give me some
>     guidance?
>
>
> There's no documentation of these parts of Clang, as far as I know, 
> besides the source code. To get a feel for what's supported take a 
> look at the existing tests (specifically test/SemaCUDA and 
> test/CodeGenCUDA dirs).
>
> Eli
>
>
>
>                                     suxing
>
>
>     On 2014/8/13 1:18, Eli Bendersky wrote:
>>
>>
>>
>>     On Tue, Aug 12, 2014 at 10:07 AM, Xing Su <suxing1989 at gmail.com
>>     <mailto:suxing1989 at gmail.com>> wrote:
>>
>>         hi everyone!
>>
>>         I tried to compile a CUDA program using clang-3.4. This
>>         program is taken
>>         from NVIDIA_CUDA-5.0_samples collection and it's a very
>>         simple program adding
>>         two vectors.
>>
>>         A few modifications to the original code were done, including
>>         1. I substituted the __global__ CUDA C keyword with
>>         __attribute__((global))
>>             in order to use clang as the compiler.
>>         2. <stdlib.h> <math.h> were added.
>>         3. declarations of blockDim, blockIdx, threadIdx were added.
>>
>>         // ==================== code begin ========================
>>         /**
>>
>>          * Vector addition: C = A + B.
>>          *
>>          * This sample is a very basic sample that implements element
>>         by element
>>          * vector addition. It is the same as the sample illustrating
>>         Chapter 2
>>          * of the programming guide with some additions like error
>>         checking.
>>          */
>>
>>         #include <stdio.h>
>>
>>         // For the CUDA runtime routines (prefixed with "cuda_")
>>         #include <cuda_runtime.h>
>>
>>         #include <malloc.h>
>>         #include <stdlib.h>
>>         #include <math.h>
>>         extern dim3 blockDim, blockIdx, threadIdx;
>>
>>         /**
>>          * CUDA Kernel Device code
>>          *
>>          * Computes the vector addition of A and B into C. The 3
>>         vectors have the same
>>          * number of elements numElements.
>>          */
>>         __attribute__((global)) void
>>         vectorAdd(const float *A, const float *B, float *C, int
>>         numElements)
>>         {
>>             int i = blockDim.x * blockIdx.x + threadIdx.x;
>>
>>             if (i < numElements)
>>             {
>>                 C[i] = A[i] + B[i];
>>             }
>>         }
>>
>>         int main(void)
>>         {
>>             ... ...
>>             return 0;
>>         }
>>
>>         //==================== code end ========================
>>
>>         $ clang -std=cuda -I/usr/local/cuda-5.0/include -o vectorAdd1
>>         vectorAdd1.cu -L/usr/local/cuda-5.0/lib64 -lcudart
>>
>>         The compiling was successful, but running the program gives:
>>
>>             [Vector addition of 50000 elements]
>>             Copy input data from the host memory to the CUDA device
>>             CUDA kernel launch with 196 blocks of 256 threads
>>             Failed to launch vectorAdd kernel (error code invalid
>>         device function )!
>>
>>         while the expected output is
>>
>>             [Vector addition of 50000 elements]
>>         Copy input data from the host memory to the CUDA device
>>         CUDA kernel launch with 196 blocks of 256 threads
>>         Copy output data from the CUDA device to the host memory
>>             Done
>>
>>         The result said that the vectorAdd function was not a valid
>>         kernel. To see what happend,
>>         I compiled the program to LLVM IR. Only the IR of function
>>         vectorAdd is shown below.
>>
>>         $ clang -S -emit-llvm -std=cuda
>>          -I/usr/local/cuda-5.0/include -o vectorAdd1.ll vectorAdd1.cu
>>
>>         //==================== code begin ==========================
>>         define void @_Z9vectorAddPKfS0_Pfi(float* %A, float* %B,
>>         float* %C, i32 %numElements) #0 {
>>           %1 = alloca float*, align 8
>>           %2 = alloca float*, align 8
>>           %3 = alloca float*, align 8
>>           %4 = alloca i32, align 4
>>           store float* %A, float** %1, align 8
>>           store float* %B, float** %2, align 8
>>           store float* %C, float** %3, align 8
>>           store i32 %numElements, i32* %4, align 4
>>           %5 = bitcast float** %1 to i8*
>>           %6 = call i32 @cudaSetupArgument(i8* %5, i64 ptrtoint (i1**
>>         getelementptr (i1** null, i32 1) to i64), i64 0)
>>           %7 = icmp eq i32 %6, 0
>>           br i1 %7, label %8, label %22
>>
>>         ; <label>:8                           ; preds = %0
>>           %9 = bitcast float** %2 to i8*
>>           %10 = call i32 @cudaSetupArgument(i8* %9, i64 ptrtoint
>>         (i1** getelementptr (i1** null, i32 1) to i64), i64 ptrtoint
>>         (float** getelementptr ({ float*, float*, float*, i32 }*
>>         null, i64 0, i32 1) to i64))
>>           %11 = icmp eq i32 %10, 0
>>           br i1 %11, label %12, label %22
>>
>>         ; <label>:12                            ; preds = %8
>>           %13 = bitcast float** %3 to i8*
>>           %14 = call i32 @cudaSetupArgument(i8* %13, i64 ptrtoint
>>         (i1** getelementptr (i1** null, i32 1) to i64), i64 ptrtoint
>>         (float** getelementptr ({ float*, float*, float*, i32 }*
>>         null, i64 0, i32 2) to i64))
>>           %15 = icmp eq i32 %14, 0
>>           br i1 %15, label %16, label %22
>>
>>         ; <label>:16                            ; preds = %12
>>           %17 = bitcast i32* %4 to i8*
>>           %18 = call i32 @cudaSetupArgument(i8* %17, i64 ptrtoint
>>         (i32* getelementptr (i32* null, i32 1) to i64), i64 ptrtoint
>>         (i32* getelementptr ({ float*, float*, float*, i32 }* null,
>>         i64 0, i32 3) to i64))
>>           %19 = icmp eq i32 %18, 0
>>           br i1 %19, label %20, label %22
>>
>>         ; <label>:20                            ; preds = %16
>>           %21 = call i32 @cudaLaunch(i8* bitcast (void (float*,
>>         float*, float*, i32)* @_Z9vectorAddPKfS0_Pfi to i8*))
>>           br label %22
>>
>>         ; <label>:22                            ; preds = %20, %16,
>>         %12, %8, %0
>>           ret void
>>         }
>>
>>         //======================= code end =====================
>>
>>         From the emitted IR code, I can see NO addrspace
>>         declarations, which should be specific to
>>         NVVM IR. So this may be the problem.
>>
>>         Could anyone tell me how to fix this?
>>
>>         Any help is appreciated!
>>
>>              suxing
>>
>>     Hi Suxing,
>>
>>     Clang cannot yet compile CUDA out-of-the-box like that.
>>     Definitely not mixed CUDA code (where host and device code are in
>>     the same file). Clang can be made, with some effort, to compile
>>     stand-alone device code, but some critical steps are missing. For
>>     example, you have to map threadIdx and other similar special
>>     globals to appropriate intrinsic calls
>>     (@llvm.nvvm.read.ptx.sreg.tid.*), and not just declare them.
>>
>>     Eli
>>
>>
>>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20140814/33287a6f/attachment.html>


More information about the cfe-dev mailing list