[cfe-dev] compile a simple CUDA program using clang-3.4

Wed Aug 13 17:43:41 PDT 2014

Hi Madhur, I want to do some research on heterogeneous programming model.
What about you? Perhaps we can find something in common :-)

Xing

On 08/14/2014 01:00 AM, Madhur Amilkanthwar wrote:
> I too suffered a lot for compiling CUDA C with Clang 3.2.
> By the way, Xing why do you want to compile CUDA C with Clang?
>
>
> On Wed, Aug 13, 2014 at 8:25 PM, Eli Bendersky <eliben at google.com 
> <mailto:eliben at google.com>> wrote:
>
>
>
>
>     On Tue, Aug 12, 2014 at 6:38 PM, Xing Su <suxing1989 at gmail.com
>     <mailto:suxing1989 at gmail.com>> wrote:
>
>         Thanks Eli,
>
>         It sounds that at the moment support to CUDA in Clang is far from
>         production use ...
>
>         I'd like to know what the status of CUDA support is in clang,
>         but I am not able to find anything reporting this.
>         Are you a developer of this part, or could you give me some
>         guidance?
>
>
>     There's no documentation of these parts of Clang, as far as I
>     know, besides the source code. To get a feel for what's supported
>     take a look at the existing tests (specifically test/SemaCUDA and
>     test/CodeGenCUDA dirs).
>
>     Eli
>
>
>
>                                         suxing
>
>
>         On 2014/8/13 1:18, Eli Bendersky wrote:
>>
>>
>>
>>         On Tue, Aug 12, 2014 at 10:07 AM, Xing Su
>>         <suxing1989 at gmail.com <mailto:suxing1989 at gmail.com>> wrote:
>>
>>             hi everyone!
>>
>>             I tried to compile a CUDA program using clang-3.4. This
>>             program is taken
>>             from NVIDIA_CUDA-5.0_samples collection and it's a very
>>             simple program adding
>>             two vectors.
>>
>>             A few modifications to the original code were done, including
>>             1. I substituted the __global__ CUDA C keyword with
>>             __attribute__((global))
>>                 in order to use clang as the compiler.
>>             2. <stdlib.h> <math.h> were added.
>>             3. declarations of blockDim, blockIdx, threadIdx were added.
>>
>>             // ==================== code begin ========================
>>             /**
>>
>>              * Vector addition: C = A + B.
>>              *
>>              * This sample is a very basic sample that implements
>>             element by element
>>              * vector addition. It is the same as the sample
>>             illustrating Chapter 2
>>              * of the programming guide with some additions like
>>             error checking.
>>              */
>>
>>             #include <stdio.h>
>>
>>             // For the CUDA runtime routines (prefixed with "cuda_")
>>             #include <cuda_runtime.h>
>>
>>             #include <malloc.h>
>>             #include <stdlib.h>
>>             #include <math.h>
>>             extern dim3 blockDim, blockIdx, threadIdx;
>>
>>             /**
>>              * CUDA Kernel Device code
>>              *
>>              * Computes the vector addition of A and B into C. The 3
>>             vectors have the same
>>              * number of elements numElements.
>>              */
>>             __attribute__((global)) void
>>             vectorAdd(const float *A, const float *B, float *C, int
>>             numElements)
>>             {
>>                 int i = blockDim.x * blockIdx.x + threadIdx.x;
>>
>>                 if (i < numElements)
>>                 {
>>                     C[i] = A[i] + B[i];
>>                 }
>>             }
>>
>>             int main(void)
>>             {
>>                 ... ...
>>                 return 0;
>>             }
>>
>>             //==================== code end ========================
>>
>>             $ clang -std=cuda -I/usr/local/cuda-5.0/include -o
>>             vectorAdd1 vectorAdd1.cu -L/usr/local/cuda-5.0/lib64 -lcudart
>>
>>             The compiling was successful, but running the program gives:
>>
>>                 [Vector addition of 50000 elements]
>>                 Copy input data from the host memory to the CUDA device
>>                 CUDA kernel launch with 196 blocks of 256 threads
>>                 Failed to launch vectorAdd kernel (error code invalid
>>             device function )!
>>
>>             while the expected output is
>>
>>                 [Vector addition of 50000 elements]
>>             Copy input data from the host memory to the CUDA device
>>             CUDA kernel launch with 196 blocks of 256 threads
>>             Copy output data from the CUDA device to the host memory
>>                 Done
>>
>>             The result said that the vectorAdd function was not a
>>             valid kernel. To see what happend,
>>             I compiled the program to LLVM IR. Only the IR of
>>             function vectorAdd is shown below.
>>
>>             $ clang -S -emit-llvm -std=cuda
>>              -I/usr/local/cuda-5.0/include -o vectorAdd1.ll vectorAdd1.cu
>>
>>             //==================== code begin ==========================
>>             define void @_Z9vectorAddPKfS0_Pfi(float* %A, float* %B,
>>             float* %C, i32 %numElements) #0 {
>>               %1 = alloca float*, align 8
>>               %2 = alloca float*, align 8
>>               %3 = alloca float*, align 8
>>               %4 = alloca i32, align 4
>>               store float* %A, float** %1, align 8
>>               store float* %B, float** %2, align 8
>>               store float* %C, float** %3, align 8
>>               store i32 %numElements, i32* %4, align 4
>>               %5 = bitcast float** %1 to i8*
>>               %6 = call i32 @cudaSetupArgument(i8* %5, i64 ptrtoint
>>             (i1** getelementptr (i1** null, i32 1) to i64), i64 0)
>>               %7 = icmp eq i32 %6, 0
>>               br i1 %7, label %8, label %22
>>
>>             ; <label>:8       ; preds = %0
>>               %9 = bitcast float** %2 to i8*
>>               %10 = call i32 @cudaSetupArgument(i8* %9, i64 ptrtoint
>>             (i1** getelementptr (i1** null, i32 1) to i64), i64
>>             ptrtoint (float** getelementptr ({ float*, float*,
>>             float*, i32 }* null, i64 0, i32 1) to i64))
>>               %11 = icmp eq i32 %10, 0
>>               br i1 %11, label %12, label %22
>>
>>             ; <label>:12        ; preds = %8
>>               %13 = bitcast float** %3 to i8*
>>               %14 = call i32 @cudaSetupArgument(i8* %13, i64 ptrtoint
>>             (i1** getelementptr (i1** null, i32 1) to i64), i64
>>             ptrtoint (float** getelementptr ({ float*, float*,
>>             float*, i32 }* null, i64 0, i32 2) to i64))
>>               %15 = icmp eq i32 %14, 0
>>               br i1 %15, label %16, label %22
>>
>>             ; <label>:16        ; preds = %12
>>               %17 = bitcast i32* %4 to i8*
>>               %18 = call i32 @cudaSetupArgument(i8* %17, i64 ptrtoint
>>             (i32* getelementptr (i32* null, i32 1) to i64), i64
>>             ptrtoint (i32* getelementptr ({ float*, float*, float*,
>>             i32 }* null, i64 0, i32 3) to i64))
>>               %19 = icmp eq i32 %18, 0
>>               br i1 %19, label %20, label %22
>>
>>             ; <label>:20        ; preds = %16
>>               %21 = call i32 @cudaLaunch(i8* bitcast (void (float*,
>>             float*, float*, i32)* @_Z9vectorAddPKfS0_Pfi to i8*))
>>               br label %22
>>
>>             ; <label>:22        ; preds = %20, %16, %12, %8, %0
>>               ret void
>>             }
>>
>>             //======================= code end =====================
>>
>>             From the emitted IR code, I can see NO addrspace
>>             declarations, which should be specific to
>>             NVVM IR. So this may be the problem.
>>
>>             Could anyone tell me how to fix this?
>>
>>             Any help is appreciated!
>>
>>                            suxing
>>
>>         Hi Suxing,
>>
>>         Clang cannot yet compile CUDA out-of-the-box like that.
>>         Definitely not mixed CUDA code (where host and device code
>>         are in the same file). Clang can be made, with some effort,
>>         to compile stand-alone device code, but some critical steps
>>         are missing. For example, you have to map threadIdx and other
>>         similar special globals to appropriate intrinsic calls
>>         (@llvm.nvvm.read.ptx.sreg.tid.*), and not just declare them.
>>
>>         Eli
>>
>>
>>
>
>
>
>     _______________________________________________
>     cfe-dev mailing list
>     cfe-dev at cs.uiuc.edu <mailto:cfe-dev at cs.uiuc.edu>
>     http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
>
>
>
>
> -- 
> Thank You.
> Madhur D. Amilkanthwar
> RISE lab,
> IIT Madras.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20140814/787537a7/attachment.html>