[cfe-dev] compile a simple CUDA program using clang-3.4

Madhur Amilkanthwar madhur13490 at gmail.com
Wed Aug 13 10:00:21 PDT 2014


I too suffered a lot for compiling CUDA C with Clang 3.2.
By the way, Xing why do you want to compile CUDA C with Clang?


On Wed, Aug 13, 2014 at 8:25 PM, Eli Bendersky <eliben at google.com> wrote:

>
>
>
> On Tue, Aug 12, 2014 at 6:38 PM, Xing Su <suxing1989 at gmail.com> wrote:
>
>>  Thanks Eli,
>>
>> It sounds that at the moment support to CUDA in Clang is far from
>> production use ...
>>
>> I'd like to know what the status of CUDA support is in clang,
>> but I am not able to find anything reporting this.
>> Are you a developer of this part, or could you give me some
>> guidance?
>>
>>
> There's no documentation of these parts of Clang, as far as I know,
> besides the source code. To get a feel for what's supported take a look at
> the existing tests (specifically test/SemaCUDA and test/CodeGenCUDA dirs).
>
> Eli
>
>
>
>
>>
>>                                 suxing
>>
>>
>> On 2014/8/13 1:18, Eli Bendersky wrote:
>>
>>
>>
>>
>> On Tue, Aug 12, 2014 at 10:07 AM, Xing Su <suxing1989 at gmail.com> wrote:
>>
>>> hi everyone!
>>>
>>> I tried to compile a CUDA program using clang-3.4. This program is taken
>>> from NVIDIA_CUDA-5.0_samples collection and it's a very simple program
>>> adding
>>> two vectors.
>>>
>>> A few modifications to the original code were done, including
>>> 1. I substituted the __global__ CUDA C keyword with
>>> __attribute__((global))
>>>     in order to use clang as the compiler.
>>> 2. <stdlib.h> <math.h> were added.
>>> 3. declarations of blockDim, blockIdx, threadIdx were added.
>>>
>>> // ==================== code begin ========================
>>> /**
>>>
>>>  * Vector addition: C = A + B.
>>>  *
>>>  * This sample is a very basic sample that implements element by element
>>>  * vector addition. It is the same as the sample illustrating Chapter 2
>>>  * of the programming guide with some additions like error checking.
>>>  */
>>>
>>> #include <stdio.h>
>>>
>>> // For the CUDA runtime routines (prefixed with "cuda_")
>>> #include <cuda_runtime.h>
>>>
>>> #include <malloc.h>
>>> #include <stdlib.h>
>>> #include <math.h>
>>> extern dim3 blockDim, blockIdx, threadIdx;
>>>
>>> /**
>>>  * CUDA Kernel Device code
>>>  *
>>>  * Computes the vector addition of A and B into C. The 3 vectors have
>>> the same
>>>  * number of elements numElements.
>>>  */
>>> __attribute__((global)) void
>>> vectorAdd(const float *A, const float *B, float *C, int numElements)
>>> {
>>>     int i = blockDim.x * blockIdx.x + threadIdx.x;
>>>
>>>     if (i < numElements)
>>>     {
>>>         C[i] = A[i] + B[i];
>>>     }
>>> }
>>>
>>> int main(void)
>>> {
>>>     ... ...
>>>     return 0;
>>> }
>>>
>>> //==================== code end ========================
>>>
>>> $ clang -std=cuda -I/usr/local/cuda-5.0/include -o vectorAdd1
>>> vectorAdd1.cu -L/usr/local/cuda-5.0/lib64 -lcudart
>>>
>>> The compiling was successful, but running the program gives:
>>>
>>>     [Vector addition of 50000 elements]
>>>     Copy input data from the host memory to the CUDA device
>>>     CUDA kernel launch with 196 blocks of 256 threads
>>>     Failed to launch vectorAdd kernel (error code invalid device
>>> function )!
>>>
>>> while the expected output is
>>>
>>>     [Vector addition of 50000 elements]
>>> Copy input data from the host memory to the CUDA device
>>> CUDA kernel launch with 196 blocks of 256 threads
>>> Copy output data from the CUDA device to the host memory
>>>     Done
>>>
>>> The result said that the vectorAdd function was not a valid kernel. To
>>> see what happend,
>>> I compiled the program to LLVM IR. Only the IR of function vectorAdd is
>>> shown below.
>>>
>>> $ clang -S -emit-llvm -std=cuda  -I/usr/local/cuda-5.0/include -o
>>> vectorAdd1.ll vectorAdd1.cu
>>>
>>> //==================== code begin ==========================
>>> define void @_Z9vectorAddPKfS0_Pfi(float* %A, float* %B, float* %C, i32
>>> %numElements) #0 {
>>>   %1 = alloca float*, align 8
>>>   %2 = alloca float*, align 8
>>>   %3 = alloca float*, align 8
>>>   %4 = alloca i32, align 4
>>>   store float* %A, float** %1, align 8
>>>   store float* %B, float** %2, align 8
>>>   store float* %C, float** %3, align 8
>>>   store i32 %numElements, i32* %4, align 4
>>>   %5 = bitcast float** %1 to i8*
>>>   %6 = call i32 @cudaSetupArgument(i8* %5, i64 ptrtoint (i1**
>>> getelementptr (i1** null, i32 1) to i64), i64 0)
>>>   %7 = icmp eq i32 %6, 0
>>>   br i1 %7, label %8, label %22
>>>
>>> ; <label>:8                                       ; preds = %0
>>>   %9 = bitcast float** %2 to i8*
>>>   %10 = call i32 @cudaSetupArgument(i8* %9, i64 ptrtoint (i1**
>>> getelementptr (i1** null, i32 1) to i64), i64 ptrtoint (float**
>>> getelementptr ({ float*, float*, float*, i32 }* null, i64 0, i32 1) to i64))
>>>   %11 = icmp eq i32 %10, 0
>>>   br i1 %11, label %12, label %22
>>>
>>> ; <label>:12                                      ; preds = %8
>>>   %13 = bitcast float** %3 to i8*
>>>   %14 = call i32 @cudaSetupArgument(i8* %13, i64 ptrtoint (i1**
>>> getelementptr (i1** null, i32 1) to i64), i64 ptrtoint (float**
>>> getelementptr ({ float*, float*, float*, i32 }* null, i64 0, i32 2) to i64))
>>>   %15 = icmp eq i32 %14, 0
>>>   br i1 %15, label %16, label %22
>>>
>>> ; <label>:16                                      ; preds = %12
>>>   %17 = bitcast i32* %4 to i8*
>>>   %18 = call i32 @cudaSetupArgument(i8* %17, i64 ptrtoint (i32*
>>> getelementptr (i32* null, i32 1) to i64), i64 ptrtoint (i32* getelementptr
>>> ({ float*, float*, float*, i32 }* null, i64 0, i32 3) to i64))
>>>   %19 = icmp eq i32 %18, 0
>>>   br i1 %19, label %20, label %22
>>>
>>> ; <label>:20                                      ; preds = %16
>>>   %21 = call i32 @cudaLaunch(i8* bitcast (void (float*, float*, float*,
>>> i32)* @_Z9vectorAddPKfS0_Pfi to i8*))
>>>   br label %22
>>>
>>> ; <label>:22                                      ; preds = %20, %16,
>>> %12, %8, %0
>>>   ret void
>>> }
>>>
>>> //======================= code end =====================
>>>
>>> From the emitted IR code, I can see NO addrspace declarations, which
>>> should be specific to
>>> NVVM IR. So this may be the problem.
>>>
>>> Could anyone tell me how to fix this?
>>>
>>> Any help is appreciated!
>>>
>>>                                    suxing
>>>
>>  Hi Suxing,
>>
>>  Clang cannot yet compile CUDA out-of-the-box like that. Definitely not
>> mixed CUDA code (where host and device code are in the same file). Clang
>> can be made, with some effort, to compile stand-alone device code, but some
>> critical steps are missing. For example, you have to map threadIdx and
>> other similar special globals to appropriate intrinsic calls
>> (@llvm.nvvm.read.ptx.sreg.tid.*), and not just declare them.
>>
>>  Eli
>>
>>
>>
>>
>>
>
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev
>
>


-- 
Thank You.
Madhur D. Amilkanthwar
RISE lab,
IIT Madras.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20140813/a87689d5/attachment.html>


More information about the cfe-dev mailing list