<html>

  <head>

    <meta content="text/html; charset=UTF-8" http-equiv="Content-Type">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    Ok, I'll check the test suite then. Thanks for your help Eli :-)<br>

    <br>

    I'm trying to manually fix the generated IR file and use libnvvm to

    compile it to device code.<br>

    <br>

    <div class="moz-cite-prefix">On 08/13/2014 10:55 PM, Eli Bendersky

      wrote:<br>

    </div>

    <blockquote

cite="mid:CACLQwhFLVN3pXW=sWPw1-y-rjdabP2Ea_XWdjogEw9xqh2kiBQ@mail.gmail.com"

      type="cite">

      <div dir="ltr"><br>

        <div class="gmail_extra"><br>

          <br>

          <div class="gmail_quote">On Tue, Aug 12, 2014 at 6:38 PM, Xing

            Su <span dir="ltr"><<a moz-do-not-send="true"

                href="mailto:suxing1989@gmail.com" target="_blank">suxing1989@gmail.com</a>></span>

            wrote:<br>

            <blockquote class="gmail_quote" style="margin:0 0 0

              .8ex;border-left:1px #ccc solid;padding-left:1ex">

              <div bgcolor="#FFFFFF" text="#000000"> Thanks Eli,<br>

                <br>

                It sounds that at the moment support to CUDA in Clang is

                far from<br>

                production use ...<br>

                <br>

                I'd like to know what the status of CUDA support is in

                clang,<br>

                but I am not able to find anything reporting this. <br>

                Are you a developer of this part, or could you give me

                some<br>

                guidance?<br>

                <br>

              </div>

            </blockquote>

            <div><br>

            </div>

            <div>There's no documentation of these parts of Clang, as

              far as I know, besides the source code. To get a feel for

              what's supported take a look at the existing tests

              (specifically test/SemaCUDA and test/CodeGenCUDA dirs).</div>

            <div><br>

            </div>

            <div>Eli</div>

            <div><br>

            </div>

            <div><br>

            </div>

            <div> </div>

            <blockquote class="gmail_quote" style="margin:0 0 0

              .8ex;border-left:1px #ccc solid;padding-left:1ex">

              <div bgcolor="#FFFFFF" text="#000000"> <br>

                                                suxing

                <div>

                  <div class="h5"><br>

                    <br>

                    <div>On 2014/8/13 1:18, Eli Bendersky wrote:<br>

                    </div>

                    <blockquote type="cite">

                      <div dir="ltr"><br>

                        <div class="gmail_extra"><br>

                          <br>

                          <div class="gmail_quote">On Tue, Aug 12, 2014

                            at 10:07 AM, Xing Su <span dir="ltr"><<a

                                moz-do-not-send="true"

                                href="mailto:suxing1989@gmail.com"

                                target="_blank">suxing1989@gmail.com</a>></span>

                            wrote:<br>

                            <blockquote class="gmail_quote"

                              style="margin:0px 0px 0px

0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

                              <p dir="ltr">hi everyone!</p>

                              <p dir="ltr">I tried to compile a CUDA

                                program using clang-3.4. This program is

                                taken<br>

                                from NVIDIA_CUDA-5.0_samples collection

                                and it's a very simple program adding<br>

                                two vectors.</p>

                              <p dir="ltr">A few modifications to the

                                original code were done, including<br>

                                1. I substituted the __global__ CUDA C

                                keyword with __attribute__((global))<br>

                                    in order to use clang as the

                                compiler.<br>

                                2. <stdlib.h> <math.h> were

                                added.<br>

                                3. declarations of blockDim, blockIdx,

                                threadIdx were added.</p>

                              <p dir="ltr">// ==================== code

                                begin ========================<br>

                                /**</p>

                              <p dir="ltr"> * Vector addition: C = A +

                                B.<br>

                                 *<br>

                                 * This sample is a very basic sample

                                that implements element by element<br>

                                 * vector addition. It is the same as

                                the sample illustrating Chapter 2<br>

                                 * of the programming guide with some

                                additions like error checking.<br>

                                 */</p>

                              <p dir="ltr">#include <stdio.h></p>

                              <p dir="ltr">// For the CUDA runtime

                                routines (prefixed with "cuda_")<br>

                                #include <cuda_runtime.h></p>

                              <p dir="ltr">#include <malloc.h><br>

                                #include <stdlib.h><br>

                                #include <math.h><br>

                                extern dim3 blockDim, blockIdx,

                                threadIdx;</p>

                              <p dir="ltr">/**<br>

                                 * CUDA Kernel Device code<br>

                                 *<br>

                                 * Computes the vector addition of A and

                                B into C. The 3 vectors have the same<br>

                                 * number of elements numElements.<br>

                                 */<br>

                                __attribute__((global)) void<br>

                                vectorAdd(const float *A, const float

                                *B, float *C, int numElements)<br>

                                {<br>

                                    int i = blockDim.x * blockIdx.x +

                                threadIdx.x;</p>

                              <p dir="ltr">    if (i < numElements)<br>

                                    {<br>

                                        C[i] = A[i] + B[i];<br>

                                    }<br>

                                }</p>

                              <p dir="ltr">int main(void)<br>

                                {<br>

                                    ... ...<br>

                                    return 0;<br>

                                }</p>

                              <p dir="ltr">//==================== code

                                end ========================</p>

                              <p dir="ltr">$ clang -std=cuda

                                -I/usr/local/cuda-5.0/include -o

                                vectorAdd1 vectorAdd1.cu

                                -L/usr/local/cuda-5.0/lib64 -lcudart</p>

                              <p dir="ltr">The compiling was successful,

                                but running the program gives:</p>

                              <p dir="ltr">    [Vector addition of 50000

                                elements]<br>

                                    Copy input data from the host memory

                                to the CUDA device<br>

                                    CUDA kernel launch with 196 blocks

                                of 256 threads<br>

                                    Failed to launch vectorAdd kernel

                                (error code invalid device function )!</p>

                              <p dir="ltr">while the expected output is</p>

                              <p dir="ltr">    [Vector addition of 50000

                                elements]<br>

                                Copy input data from the host memory to

                                the CUDA device<br>

                                CUDA kernel launch with 196 blocks of

                                256 threads<br>

                                Copy output data from the CUDA device to

                                the host memory<br>

                                    Done</p>

                              <p dir="ltr">The result said that the

                                vectorAdd function was not a valid

                                kernel. To see what happend,<br>

                                I compiled the program to LLVM IR. Only

                                the IR of function vectorAdd is shown

                                below.</p>

                              <p dir="ltr">$ clang -S -emit-llvm

                                -std=cuda  -I/usr/local/cuda-5.0/include

                                -o vectorAdd1.ll vectorAdd1.cu</p>

                              <p dir="ltr">//==================== code

                                begin ==========================<br>

                                define void

                                @_Z9vectorAddPKfS0_Pfi(float* %A, float*

                                %B, float* %C, i32 %numElements) #0 {<br>

                                  %1 = alloca float*, align 8<br>

                                  %2 = alloca float*, align 8<br>

                                  %3 = alloca float*, align 8<br>

                                  %4 = alloca i32, align 4<br>

                                  store float* %A, float** %1, align 8<br>

                                  store float* %B, float** %2, align 8<br>

                                  store float* %C, float** %3, align 8<br>

                                  store i32 %numElements, i32* %4, align

                                4<br>

                                  %5 = bitcast float** %1 to i8*<br>

                                  %6 = call i32 @cudaSetupArgument(i8*

                                %5, i64 ptrtoint (i1** getelementptr

                                (i1** null, i32 1) to i64), i64 0)<br>

                                  %7 = icmp eq i32 %6, 0<br>

                                  br i1 %7, label %8, label %22</p>

                              <p dir="ltr">; <label>:8            

                                                          ; preds = %0<br>

                                  %9 = bitcast float** %2 to i8*<br>

                                  %10 = call i32 @cudaSetupArgument(i8*

                                %9, i64 ptrtoint (i1** getelementptr

                                (i1** null, i32 1) to i64), i64 ptrtoint

                                (float** getelementptr ({ float*,

                                float*, float*, i32 }* null, i64 0, i32

                                1) to i64))<br>

                                  %11 = icmp eq i32 %10, 0<br>

                                  br i1 %11, label %12, label %22</p>

                              <p dir="ltr">; <label>:12          

                                                           ; preds = %8<br>

                                  %13 = bitcast float** %3 to i8*<br>

                                  %14 = call i32 @cudaSetupArgument(i8*

                                %13, i64 ptrtoint (i1** getelementptr

                                (i1** null, i32 1) to i64), i64 ptrtoint

                                (float** getelementptr ({ float*,

                                float*, float*, i32 }* null, i64 0, i32

                                2) to i64))<br>

                                  %15 = icmp eq i32 %14, 0<br>

                                  br i1 %15, label %16, label %22</p>

                              <p dir="ltr">; <label>:16          

                                                           ; preds = %12<br>

                                  %17 = bitcast i32* %4 to i8*<br>

                                  %18 = call i32 @cudaSetupArgument(i8*

                                %17, i64 ptrtoint (i32* getelementptr

                                (i32* null, i32 1) to i64), i64 ptrtoint

                                (i32* getelementptr ({ float*, float*,

                                float*, i32 }* null, i64 0, i32 3) to

                                i64))<br>

                                  %19 = icmp eq i32 %18, 0<br>

                                  br i1 %19, label %20, label %22</p>

                              <p dir="ltr">; <label>:20          

                                                           ; preds = %16<br>

                                  %21 = call i32 @cudaLaunch(i8* bitcast

                                (void (float*, float*, float*, i32)*

                                @_Z9vectorAddPKfS0_Pfi to i8*))<br>

                                  br label %22</p>

                              <p dir="ltr">; <label>:22          

                                                           ; preds =

                                %20, %16, %12, %8, %0<br>

                                  ret void<br>

                                }</p>

                              <p dir="ltr">//=======================

                                code end =====================<br>

                              </p>

                              <p dir="ltr">From the emitted IR code, I

                                can see NO addrspace declarations, which

                                should be specific to<br>

                                NVVM IR. So this may be the problem.</p>

                              <p dir="ltr">Could anyone tell me how to

                                fix this?</p>

                              <p dir="ltr">Any help is appreciated!</p>

                              <p dir="ltr">                             

                                     suxing<br>

                              </p>

                            </blockquote>

                          </div>

                          Hi Suxing,</div>

                        <div class="gmail_extra"><br>

                        </div>

                        <div class="gmail_extra">Clang cannot yet

                          compile CUDA out-of-the-box like that.

                          Definitely not mixed CUDA code (where host and

                          device code are in the same file). Clang can

                          be made, with some effort, to compile

                          stand-alone device code, but some critical

                          steps are missing. For example, you have to

                          map threadIdx and other similar special

                          globals to appropriate intrinsic calls

                          (@llvm.nvvm.read.ptx.sreg.tid.*), and not just

                          declare them. </div>

                        <div class="gmail_extra"><br>

                        </div>

                        <div class="gmail_extra">Eli</div>

                        <div class="gmail_extra"><br>

                        </div>

                        <div class="gmail_extra"><br>

                        </div>

                        <div class="gmail_extra"><br>

                        </div>

                      </div>

                    </blockquote>

                    <br>

                  </div>

                </div>

              </div>

            </blockquote>

          </div>

          <br>

        </div>

      </div>

    </blockquote>

    <br>

  </body>

</html>