<html>
  <head>
    <meta content="text/html; charset=UTF-8" http-equiv="Content-Type">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    Hi Madhur, I want to do some research on heterogeneous programming
    model.<br>
    What about you? Perhaps we can find something in common :-)<br>
    <br>
    <br>
    Xing<br>
    <br>
    <br>
    <div class="moz-cite-prefix">On 08/14/2014 01:00 AM, Madhur
      Amilkanthwar wrote:<br>
    </div>
    <blockquote
cite="mid:CAMTh1gUJETT++Z2eKi5gtN1EMDTjbW9FB90b8d=OaPUwDHUMNg@mail.gmail.com"
      type="cite">
      <div dir="ltr">
        <div>I too suffered a lot for compiling CUDA C with Clang 3.2. <br>
        </div>
        By the way, Xing why do you want to compile CUDA C with Clang?<br>
      </div>
      <div class="gmail_extra"><br>
        <br>
        <div class="gmail_quote">On Wed, Aug 13, 2014 at 8:25 PM, Eli
          Bendersky <span dir="ltr"><<a moz-do-not-send="true"
              href="mailto:eliben@google.com" target="_blank">eliben@google.com</a>></span>
          wrote:<br>
          <blockquote class="gmail_quote" style="margin:0 0 0
            .8ex;border-left:1px #ccc solid;padding-left:1ex">
            <div dir="ltr"><br>
              <div class="gmail_extra"><br>
                <br>
                <div class="gmail_quote">
                  <div class="">On Tue, Aug 12, 2014 at 6:38 PM, Xing Su
                    <span dir="ltr"><<a moz-do-not-send="true"
                        href="mailto:suxing1989@gmail.com"
                        target="_blank">suxing1989@gmail.com</a>></span>
                    wrote:<br>
                    <blockquote class="gmail_quote" style="margin:0 0 0
                      .8ex;border-left:1px #ccc solid;padding-left:1ex">
                      <div bgcolor="#FFFFFF" text="#000000"> Thanks Eli,<br>
                        <br>
                        It sounds that at the moment support to CUDA in
                        Clang is far from<br>
                        production use ...<br>
                        <br>
                        I'd like to know what the status of CUDA support
                        is in clang,<br>
                        but I am not able to find anything reporting
                        this. <br>
                        Are you a developer of this part, or could you
                        give me some<br>
                        guidance?<br>
                        <br>
                      </div>
                    </blockquote>
                    <div><br>
                    </div>
                  </div>
                  <div>There's no documentation of these parts of Clang,
                    as far as I know, besides the source code. To get a
                    feel for what's supported take a look at the
                    existing tests (specifically test/SemaCUDA and
                    test/CodeGenCUDA dirs).</div>
                  <span class="HOEnZb"><font color="#888888">
                      <div><br>
                      </div>
                      <div>Eli</div>
                    </font></span>
                  <div>
                    <div class="h5">
                      <div><br>
                      </div>
                      <div><br>
                      </div>
                      <div> </div>
                      <blockquote class="gmail_quote" style="margin:0 0
                        0 .8ex;border-left:1px #ccc
                        solid;padding-left:1ex">
                        <div bgcolor="#FFFFFF" text="#000000"> <br>
                                                          suxing
                          <div>
                            <div><br>
                              <br>
                              <div>On 2014/8/13 1:18, Eli Bendersky
                                wrote:<br>
                              </div>
                              <blockquote type="cite">
                                <div dir="ltr"><br>
                                  <div class="gmail_extra"><br>
                                    <br>
                                    <div class="gmail_quote">On Tue, Aug
                                      12, 2014 at 10:07 AM, Xing Su <span
                                        dir="ltr"><<a
                                          moz-do-not-send="true"
                                          href="mailto:suxing1989@gmail.com"
                                          target="_blank">suxing1989@gmail.com</a>></span>
                                      wrote:<br>
                                      <blockquote class="gmail_quote"
                                        style="margin:0px 0px 0px
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
                                        <p dir="ltr">hi everyone!</p>
                                        <p dir="ltr">I tried to compile
                                          a CUDA program using
                                          clang-3.4. This program is
                                          taken<br>
                                          from NVIDIA_CUDA-5.0_samples
                                          collection and it's a very
                                          simple program adding<br>
                                          two vectors.</p>
                                        <p dir="ltr">A few modifications
                                          to the original code were
                                          done, including<br>
                                          1. I substituted the
                                          __global__ CUDA C keyword with
                                          __attribute__((global))<br>
                                              in order to use clang as
                                          the compiler.<br>
                                          2. <stdlib.h>
                                          <math.h> were added.<br>
                                          3. declarations of blockDim,
                                          blockIdx, threadIdx were
                                          added.</p>
                                        <p dir="ltr">//
                                          ==================== code
                                          begin ========================<br>
                                          /**</p>
                                        <p dir="ltr"> * Vector addition:
                                          C = A + B.<br>
                                           *<br>
                                           * This sample is a very basic
                                          sample that implements element
                                          by element<br>
                                           * vector addition. It is the
                                          same as the sample
                                          illustrating Chapter 2<br>
                                           * of the programming guide
                                          with some additions like error
                                          checking.<br>
                                           */</p>
                                        <p dir="ltr">#include
                                          <stdio.h></p>
                                        <p dir="ltr">// For the CUDA
                                          runtime routines (prefixed
                                          with "cuda_")<br>
                                          #include
                                          <cuda_runtime.h></p>
                                        <p dir="ltr">#include
                                          <malloc.h><br>
                                          #include <stdlib.h><br>
                                          #include <math.h><br>
                                          extern dim3 blockDim,
                                          blockIdx, threadIdx;</p>
                                        <p dir="ltr">/**<br>
                                           * CUDA Kernel Device code<br>
                                           *<br>
                                           * Computes the vector
                                          addition of A and B into C.
                                          The 3 vectors have the same<br>
                                           * number of elements
                                          numElements.<br>
                                           */<br>
                                          __attribute__((global)) void<br>
                                          vectorAdd(const float *A,
                                          const float *B, float *C, int
                                          numElements)<br>
                                          {<br>
                                              int i = blockDim.x *
                                          blockIdx.x + threadIdx.x;</p>
                                        <p dir="ltr">    if (i <
                                          numElements)<br>
                                              {<br>
                                                  C[i] = A[i] + B[i];<br>
                                              }<br>
                                          }</p>
                                        <p dir="ltr">int main(void)<br>
                                          {<br>
                                              ... ...<br>
                                              return 0;<br>
                                          }</p>
                                        <p dir="ltr">//====================
                                          code end
                                          ========================</p>
                                        <p dir="ltr">$ clang -std=cuda
                                          -I/usr/local/cuda-5.0/include
                                          -o vectorAdd1 vectorAdd1.cu
                                          -L/usr/local/cuda-5.0/lib64
                                          -lcudart</p>
                                        <p dir="ltr">The compiling was
                                          successful, but running the
                                          program gives:</p>
                                        <p dir="ltr">    [Vector
                                          addition of 50000 elements]<br>
                                              Copy input data from the
                                          host memory to the CUDA device<br>
                                              CUDA kernel launch with
                                          196 blocks of 256 threads<br>
                                              Failed to launch vectorAdd
                                          kernel (error code invalid
                                          device function )!</p>
                                        <p dir="ltr">while the expected
                                          output is</p>
                                        <p dir="ltr">    [Vector
                                          addition of 50000 elements]<br>
                                          Copy input data from the host
                                          memory to the CUDA device<br>
                                          CUDA kernel launch with 196
                                          blocks of 256 threads<br>
                                          Copy output data from the CUDA
                                          device to the host memory<br>
                                              Done</p>
                                        <p dir="ltr">The result said
                                          that the vectorAdd function
                                          was not a valid kernel. To see
                                          what happend,<br>
                                          I compiled the program to LLVM
                                          IR. Only the IR of function
                                          vectorAdd is shown below.</p>
                                        <p dir="ltr">$ clang -S
                                          -emit-llvm -std=cuda
                                           -I/usr/local/cuda-5.0/include
                                          -o vectorAdd1.ll vectorAdd1.cu</p>
                                        <p dir="ltr">//====================
                                          code begin
                                          ==========================<br>
                                          define void
                                          @_Z9vectorAddPKfS0_Pfi(float*
                                          %A, float* %B, float* %C, i32
                                          %numElements) #0 {<br>
                                            %1 = alloca float*, align 8<br>
                                            %2 = alloca float*, align 8<br>
                                            %3 = alloca float*, align 8<br>
                                            %4 = alloca i32, align 4<br>
                                            store float* %A, float** %1,
                                          align 8<br>
                                            store float* %B, float** %2,
                                          align 8<br>
                                            store float* %C, float** %3,
                                          align 8<br>
                                            store i32 %numElements, i32*
                                          %4, align 4<br>
                                            %5 = bitcast float** %1 to
                                          i8*<br>
                                            %6 = call i32
                                          @cudaSetupArgument(i8* %5, i64
                                          ptrtoint (i1** getelementptr
                                          (i1** null, i32 1) to i64),
                                          i64 0)<br>
                                            %7 = icmp eq i32 %6, 0<br>
                                            br i1 %7, label %8, label
                                          %22</p>
                                        <p dir="ltr">; <label>:8  
                                                                       
                                                ; preds = %0<br>
                                            %9 = bitcast float** %2 to
                                          i8*<br>
                                            %10 = call i32
                                          @cudaSetupArgument(i8* %9, i64
                                          ptrtoint (i1** getelementptr
                                          (i1** null, i32 1) to i64),
                                          i64 ptrtoint (float**
                                          getelementptr ({ float*,
                                          float*, float*, i32 }* null,
                                          i64 0, i32 1) to i64))<br>
                                            %11 = icmp eq i32 %10, 0<br>
                                            br i1 %11, label %12, label
                                          %22</p>
                                        <p dir="ltr">; <label>:12
                                                                       
                                                 ; preds = %8<br>
                                            %13 = bitcast float** %3 to
                                          i8*<br>
                                            %14 = call i32
                                          @cudaSetupArgument(i8* %13,
                                          i64 ptrtoint (i1**
                                          getelementptr (i1** null, i32
                                          1) to i64), i64 ptrtoint
                                          (float** getelementptr ({
                                          float*, float*, float*, i32 }*
                                          null, i64 0, i32 2) to i64))<br>
                                            %15 = icmp eq i32 %14, 0<br>
                                            br i1 %15, label %16, label
                                          %22</p>
                                        <p dir="ltr">; <label>:16
                                                                       
                                                 ; preds = %12<br>
                                            %17 = bitcast i32* %4 to i8*<br>
                                            %18 = call i32
                                          @cudaSetupArgument(i8* %17,
                                          i64 ptrtoint (i32*
                                          getelementptr (i32* null, i32
                                          1) to i64), i64 ptrtoint (i32*
                                          getelementptr ({ float*,
                                          float*, float*, i32 }* null,
                                          i64 0, i32 3) to i64))<br>
                                            %19 = icmp eq i32 %18, 0<br>
                                            br i1 %19, label %20, label
                                          %22</p>
                                        <p dir="ltr">; <label>:20
                                                                       
                                                 ; preds = %16<br>
                                            %21 = call i32
                                          @cudaLaunch(i8* bitcast (void
                                          (float*, float*, float*, i32)*
                                          @_Z9vectorAddPKfS0_Pfi to
                                          i8*))<br>
                                            br label %22</p>
                                        <p dir="ltr">; <label>:22
                                                                       
                                                 ; preds = %20, %16,
                                          %12, %8, %0<br>
                                            ret void<br>
                                          }</p>
                                        <p dir="ltr">//=======================
                                          code end =====================<br>
                                        </p>
                                        <p dir="ltr">From the emitted IR
                                          code, I can see NO addrspace
                                          declarations, which should be
                                          specific to<br>
                                          NVVM IR. So this may be the
                                          problem.</p>
                                        <p dir="ltr">Could anyone tell
                                          me how to fix this?</p>
                                        <p dir="ltr">Any help is
                                          appreciated!</p>
                                        <p dir="ltr">                   
                                                         suxing<br>
                                        </p>
                                      </blockquote>
                                    </div>
                                    Hi Suxing,</div>
                                  <div class="gmail_extra"><br>
                                  </div>
                                  <div class="gmail_extra">Clang cannot
                                    yet compile CUDA out-of-the-box like
                                    that. Definitely not mixed CUDA code
                                    (where host and device code are in
                                    the same file). Clang can be made,
                                    with some effort, to compile
                                    stand-alone device code, but some
                                    critical steps are missing. For
                                    example, you have to map threadIdx
                                    and other similar special globals to
                                    appropriate intrinsic calls
                                    (@llvm.nvvm.read.ptx.sreg.tid.*),
                                    and not just declare them. </div>
                                  <div class="gmail_extra"><br>
                                  </div>
                                  <div class="gmail_extra">Eli</div>
                                  <div class="gmail_extra"><br>
                                  </div>
                                  <div class="gmail_extra"><br>
                                  </div>
                                  <div class="gmail_extra"><br>
                                  </div>
                                </div>
                              </blockquote>
                              <br>
                            </div>
                          </div>
                        </div>
                      </blockquote>
                    </div>
                  </div>
                </div>
                <br>
              </div>
            </div>
            <br>
            _______________________________________________<br>
            cfe-dev mailing list<br>
            <a moz-do-not-send="true" href="mailto:cfe-dev@cs.uiuc.edu">cfe-dev@cs.uiuc.edu</a><br>
            <a moz-do-not-send="true"
              href="http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev"
              target="_blank">http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev</a><br>
            <br>
          </blockquote>
        </div>
        <br>
        <br clear="all">
        <br>
        -- <br>
        Thank You.<br>
        Madhur D. Amilkanthwar<br>
        RISE lab,<br>
        IIT Madras.<br>
      </div>
    </blockquote>
    <br>
  </body>
</html>