<div dir="ltr"><div>Now THAT answers my question. Thank you very much, Simeon! (I was hoping there would be a simpler answer, of course...)<br></div><div><br></div><div>Geoff</div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Nov 23, 2020 at 2:26 PM Simeon Ehrig <<a href="mailto:s.ehrig@hzdr.de">s.ehrig@hzdr.de</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
  
    
  
  <div>
    <p>Hi,</p>
    Let me give you a little overview of how the CUDA mode works in
    Cling.<br>
    <br>
    The workflow of the compiler pipeline is:
    <p>1. compile the device code to nvptx<br>
      2. wrap the nvptx code in fatbin<br>
      3. write the fatbin code to a specific file where the host's LLVM
      IR generator can find it<br>
        - it is more of a workaround because the API of the LLVM IR
      generator does not allow you to pass the fatbin code directly as a
      string or use a virtual file<br>
      4. generate the LLVM IR code of the host<br>
        - During generation, the fatbin code is integrated as a text
      segment into the llvm IR code of the hostt<br>
        - the LLVM IR gen generates some CUDA library calls depending on
      the device code, e.g. registration of a kernel<br>
        - often it is done via the global init and deinit functions,
      which are executed before and after the main function - attention,
      here I had some problems in the past because I forgot to call them<br>
        - there was also a problem with non-unique init functions, see
      here:
<a href="https://github.com/root-project/cling/pull/233/commits/818faeff0ed86c9730334634f6b58241eae1bf12" target="_blank">https://github.com/root-project/cling/pull/233/commits/818faeff0ed86c9730334634f6b58241eae1bf12</a><br>
      5. generate the x86 machine code<br>
      6. execute</p>
    <p>Here is a list of interesting code sections where I have solved
      the problems. The full compiler pipeline for the device code is
      full integrate into Cling and does not depend on the CUDA SDK.</p>
    <p>1. We have two Cling::Interpreter instances. One for the host
      side and one for the device side. The Cling::Interpreter class is
      a central concept in Cling. [1]<br>
       - a Cling::Interpreter instance contains different components for
      compiling the source code to machine code<br>
       - in our case it is interesting that <br>
         - the Cling::Interpreter has a Cling::IncrementalParser, which
      contains the Clang::CompilerInstance that generates the LLVM IR
      module [2]<br>
         - the Clang::CompilerInstance is set up with the same arguments
      as the Clang driver [3][4]<br>
         - Cling::Interpreter does the machine code generation for the
      host side<br>
         - has an IncrementalCUDADeviceCompiler object which contains
      the Cling::Interpreter instance for the device code [5]<br>
         - the parse function first calls the parse function of the
      device before the host parses the code [6]<br>
      2. The IncrementalCUDADeviceCompiler executes all functions to
      compile the source code into the Fatbin and inject it into the
      LLVM IR generator of the host code.<br>
         - at first the LLVM IR code is generated with the
      Clang::CompilerInstance [7]<br>
         - after that, it is compiled to NVPTX <br>
           - I added my own machine code generator instead of changing
      the code generator of Cling::Interpreter - it is a workaround and
      needs to be refactored [8]<br>
         - than the code is wrapped in fatbin [9]<br>
          - Originally I used the tool from Nvida, but then Hal Finkel
      reimplemented the tool in llvm-project-cxxjit [10]<br>
         - the fatbin code is written to a file where the CodeGen of the
      host can find it [11][12][13]<br>
    </p>
    <p>[1]
<a href="https://github.com/root-project/cling/blob/master/include/cling/Interpreter/Interpreter.h" target="_blank">https://github.com/root-project/cling/blob/master/include/cling/Interpreter/Interpreter.h</a><br>
      [2]
<a href="https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/Interpreter.cpp#L732" target="_blank">https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/Interpreter.cpp#L732</a><br>
      [3]
<a href="https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/CIFactory.cpp#L1302" target="_blank">https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/CIFactory.cpp#L1302</a><br>
      [4]
<a href="https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/IncrementalCUDADeviceCompiler.cpp#L55" target="_blank">https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/IncrementalCUDADeviceCompiler.cpp#L55</a><br>
      [5]
<a href="https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/include/cling/Interpreter/Interpreter.h#L214" target="_blank">https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/include/cling/Interpreter/Interpreter.h#L214</a><br>
      [6]
<a href="https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/Interpreter.cpp#L833" target="_blank">https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/Interpreter.cpp#L833</a><br>
      [7]
<a href="https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/IncrementalCUDADeviceCompiler.cpp#L267" target="_blank">https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/IncrementalCUDADeviceCompiler.cpp#L267</a><br>
      [8]
<a href="https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/IncrementalCUDADeviceCompiler.cpp#L286" target="_blank">https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/IncrementalCUDADeviceCompiler.cpp#L286</a><br>
      [9]
<a href="https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/IncrementalCUDADeviceCompiler.cpp#L321" target="_blank">https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/IncrementalCUDADeviceCompiler.cpp#L321</a><br>
      [10] <a href="https://github.com/hfinkel/llvm-project-cxxjit" target="_blank">https://github.com/hfinkel/llvm-project-cxxjit</a><br>
      [11]
<a href="https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/IncrementalCUDADeviceCompiler.cpp#L402" target="_blank">https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/IncrementalCUDADeviceCompiler.cpp#L402</a><br>
      [12]
<a href="https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/Interpreter.cpp#L240" target="_blank">https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/Interpreter.cpp#L240</a><br>
      [13]
<a href="https://github.com/llvm/llvm-project/blob/101309fe048e66873cfd972c47c4b7e7f2b99f41/clang/lib/CodeGen/CGCUDANV.cpp#L534" target="_blank">https://github.com/llvm/llvm-project/blob/101309fe048e66873cfd972c47c4b7e7f2b99f41/clang/lib/CodeGen/CGCUDANV.cpp#L534</a><br>
    </p>
    <div>On 11/23/20 11:34 AM, Stefan Gränitz
      wrote:<br>
    </div>
    <blockquote type="cite">
      
      <blockquote type="cite">My impression is that he actually uses
        nvcc to compile the CUDA kernels, not clang</blockquote>
      The constructor here looks very much like the CUDA command line
      options are added to a clang::CompilerInstance, I might be wrong,
      but you could try to follow the trace and see where it ends up:<br>
      <br>
      <a href="https://github.com/root-project/cling/blob/master/lib/Interpreter/IncrementalCUDADeviceCompiler.cpp" target="_blank">https://github.com/root-project/cling/blob/master/lib/Interpreter/IncrementalCUDADeviceCompiler.cpp</a><br>
      <br>
      Disclaimer: I am not familiar with the details of Simeons work or
      cling or even with JITing CUDA :) Maybe Simeon can confirm or deny
      my guess.<br>
    </blockquote>
    That is correct. I create a clang::CompilerInstance to compile
    nvptx.<br>
    <blockquote type="cite"> <br>
      <br>
      On 22/11/2020 09:09, Vassil Vassilev wrote:<br>
      <blockquote type="cite">
        Adding Simeon in the loop for Cling and CUDA. </blockquote>
      Thanks, hi Simeon!<br>
      <br>
      <br>
      <div>On 22/11/2020 09:22, Geoff Levner
        wrote:<br>
      </div>
      <blockquote type="cite">
        
        <div dir="auto">
          <div>Hi, Stefan.
            <div dir="auto"><br>
            </div>
            <div dir="auto">Yes, when compiling from the command line,
              clang does all the work for you transparently. But behind
              the scenes it performs two passes: one to compile source
              code for the host, and one to compile CUDA kernels. </div>
            <div dir="auto"><br>
            </div>
            <div dir="auto">When compiling in memory, as far as I can
              tell, you have to perform those two passes yourself. And
              the CUDA pass produces a Module that is incompatible with
              the host Module. You cannot simply add it to the JIT. I
              don't know what to do with it. </div>
            <div dir="auto"><br>
            </div>
            <div dir="auto">And yes, I did watch Simeon's presentation,
              but he didn't get into that level of detail (or if he did,
              I missed it). My impression is that he actually uses nvcc
              to compile the CUDA kernels, not clang, using his own
              parser to separate and adapt the source code... </div>
            <div dir="auto"><br>
            </div>
            <div dir="auto">Thanks, </div>
            <div dir="auto">Geoff </div>
            <br>
            <br>
            <div class="gmail_quote">
              <div dir="ltr" class="gmail_attr">Le dim. 22 nov. 2020 à
                01:03, Stefan Gränitz <<a href="mailto:stefan.graenitz@gmail.com" rel="noreferrer" target="_blank">stefan.graenitz@gmail.com</a>>
                a écrit :<br>
              </div>
              <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
                <div> Hi Geoff<br>
                  <br>
                  It looks like clang does that altogether: <a href="https://llvm.org/docs/CompileCudaWithLLVM.html" rel="noreferrer noreferrer" target="_blank">https://llvm.org/docs/CompileCudaWithLLVM.html</a><br>
                  <br>
                  And, probably related: CUDA support has been added to
                  Cling and there was a presentation for it at the last
                  Dev Meeting <a href="https://www.youtube.com/watch?v=XjjZRhiFDVs" rel="noreferrer noreferrer" target="_blank">https://www.youtube.com/watch?v=XjjZRhiFDVs</a><br>
                  <br>
                  Best,<br>
                  Stefan<br>
                  <br>
                  <div>On 20/11/2020 12:09, Geoff Levner via llvm-dev
                    wrote:<br>
                  </div>
                  <blockquote type="cite">
                    <div dir="ltr">
                      <div>Thanks for that, Valentin.</div>
                      <div><br>
                      </div>
                      <div>To be sure I understand what you are
                        saying... Assume we are talking about a single
                        .cu file containing both a C++ function and a
                        CUDA kernel that it invokes, using
                        <<<>>> syntax. Are you
                        suggesting that we bypass clang altogether and
                        use the Nvidia API to compile and install the
                        CUDA kernel? If we do that, how will the
                        JIT-compiled C++ function find the kernel?</div>
                      <div><br>
                      </div>
                      <div>Geoff<br>
                      </div>
                    </div>
                    <br>
                    <div class="gmail_quote">
                      <div dir="ltr" class="gmail_attr">On Thu, Nov 19,
                        2020 at 6:34 PM Valentin Churavy <<a href="mailto:v.churavy@gmail.com" rel="noreferrer noreferrer" target="_blank">v.churavy@gmail.com</a>>
                        wrote:<br>
                      </div>
                      <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
                        <div dir="ltr">
                          <div>Sound right now like you are emitting an
                            LLVM module?<br>
                          </div>
                          <div>The best strategy is probably to use to
                            emit a PTX module and then pass that to the 
                            CUDA driver. This is what we do on the Julia
                            side in CUDA.jl.</div>
                          <div><br>
                          </div>
                          <div>Nvidia has a somewhat helpful tutorial on
                            this at <a href="https://github.com/NVIDIA/cuda-samples/blob/c4e2869a2becb4b6d9ce5f64914406bf5e239662/Samples/vectorAdd_nvrtc/vectorAdd.cpp" rel="noreferrer noreferrer" target="_blank">https://github.com/NVIDIA/cuda-samples/blob/c4e2869a2becb4b6d9ce5f64914406bf5e239662/Samples/vectorAdd_nvrtc/vectorAdd.cpp</a></div>
                          <div>and <a href="https://github.com/NVIDIA/cuda-samples/blob/c4e2869a2becb4b6d9ce5f64914406bf5e239662/Samples/simpleDrvRuntime/simpleDrvRuntime.cpp" rel="noreferrer noreferrer" target="_blank">https://github.com/NVIDIA/cuda-samples/blob/c4e2869a2becb4b6d9ce5f64914406bf5e239662/Samples/simpleDrvRuntime/simpleDrvRuntime.cpp</a></div>
                          <div><br>
                          </div>
                          <div>Hope that helps.</div>
                          <div>-V<br>
                          </div>
                          <div><br>
                          </div>
                        </div>
                        <br>
                        <div class="gmail_quote">
                          <div dir="ltr" class="gmail_attr">On Thu, Nov
                            19, 2020 at 12:11 PM Geoff Levner via
                            llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org" rel="noreferrer noreferrer" target="_blank">llvm-dev@lists.llvm.org</a>>
                            wrote:<br>
                          </div>
                          <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
                            <div dir="ltr">
                              <div>I have made a bit of progress... When
                                compiling CUDA source code in memory,
                                the Compilation instance returned by
                                Driver::BuildCompilation() contains two
                                clang Commands: one for the host and one
                                for the CUDA device. I can execute both
                                commands using EmitLLVMOnlyActions. I
                                add the Module from the host compilation
                                to my JIT as usual, but... what to do
                                with the Module from the device
                                compilation? If I just add it to the
                                JIT, I get an error message like this:</div>
                              <div><br>
                              </div>
                              <div>    Added modules have incompatible
                                data layouts:
                                e-i64:64-i128:128-v16:16-v32:32-n16:32:64
                                (module) vs
e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128
                                (jit)</div>
                              <div><br>
                              </div>
                              <div>Any suggestions as to what to do with
                                the Module containing CUDA kernel code,
                                so that the host Module can invoke it?</div>
                              <div><br>
                              </div>
                              <div>Geoff<br>
                              </div>
                              <br>
                              <div class="gmail_quote">
                                <div dir="ltr" class="gmail_attr">On
                                  Tue, Nov 17, 2020 at 6:39 PM Geoff
                                  Levner <<a href="mailto:glevner@gmail.com" rel="noreferrer noreferrer" target="_blank">glevner@gmail.com</a>>
                                  wrote:<br>
                                </div>
                                <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
                                  <div dir="ltr">
                                    <div>We have an application that
                                      allows the user to compile and
                                      execute C++ code on the fly, using
                                      Orc JIT v2, via the LLJIT class.
                                      And we would like to extend it to
                                      allow the user to provide CUDA
                                      source code as well, for GPU
                                      programming. But I am having a
                                      hard time figuring out how to do
                                      it.</div>
                                    <div><br>
                                    </div>
                                    <div>To JIT compile C++ code, we do
                                      basically as follows:</div>
                                    <div><br>
                                    </div>
                                    <div>1. call
                                      Driver::BuildCompilation(), which
                                      returns a clang Command to execute</div>
                                    <div>2. create a CompilerInvocation
                                      using the arguments from the
                                      Command</div>
                                    <div>3. create a CompilerInstance
                                      around the CompilerInvocation</div>
                                    <div>4. use the CompilerInstance to
                                      execute an EmitLLVMOnlyAction</div>
                                    <div>5. retrieve the resulting
                                      Module from the action and add it
                                      to the JIT</div>
                                    <div><br>
                                    </div>
                                    <div>But to compile C++ requires
                                      only a single clang command. When
                                      you add CUDA to the equation, you
                                      add several other steps. If you
                                      use the clang front end to
                                      compile, clang does the following:</div>
                                    <div><br>
                                    </div>
                                    <div>1. compiles the driver source
                                      code<br>
                                    </div>
                                    <div>2. compiles the resulting PTX
                                      code using the CUDA ptxas command<br>
                                    </div>
                                    <div>3. builds a "fat binary" using
                                      the CUDA fatbinary command</div>
                                    <div>4. compiles the host source
                                      code and links in the fat binary</div>
                                    <div><br>
                                    </div>
                                    <div>So my question is: how do we
                                      replicate that process in memory,
                                      to generate modules that we can
                                      add to our JIT?</div>
                                    <div><br>
                                    </div>
                                    <div>I am no CUDA expert, and not
                                      much of a clang expert either, so
                                      if anyone out there can point me
                                      in the right direction, I would be
                                      grateful.</div>
                                    <div><br>
                                    </div>
                                    <div>Geoff</div>
                                    <div><br>
                                    </div>
                                  </div>
                                </blockquote>
                              </div>
                            </div>
_______________________________________________<br>
                            LLVM Developers mailing list<br>
                            <a href="mailto:llvm-dev@lists.llvm.org" rel="noreferrer noreferrer" target="_blank">llvm-dev@lists.llvm.org</a><br>
                            <a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" rel="noreferrer noreferrer noreferrer" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><br>
                          </blockquote>
                        </div>
                      </blockquote>
                    </div>
                    <br>
                    <fieldset></fieldset>
                    <pre>_______________________________________________
LLVM Developers mailing list
<a href="mailto:llvm-dev@lists.llvm.org" rel="noreferrer noreferrer" target="_blank">llvm-dev@lists.llvm.org</a>
<a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" rel="noreferrer noreferrer" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a>
</pre>
                  </blockquote>
                  <pre cols="72">-- 
<a href="https://flowcrypt.com/pub/stefan.graenitz@gmail.com" rel="noreferrer noreferrer" target="_blank">https://flowcrypt.com/pub/stefan.graenitz@gmail.com</a></pre>
                </div>
              </blockquote>
            </div>
          </div>
        </div>
      </blockquote>
      <pre cols="72">-- 
<a href="https://flowcrypt.com/pub/stefan.graenitz@gmail.com" target="_blank">https://flowcrypt.com/pub/stefan.graenitz@gmail.com</a></pre>
    </blockquote>
    Cheers,<br>
    Simeon<br>
    <pre cols="72">-- 
Simeon Ehrig
Institut für Strahlenphysik
Helmholtz-Zentrum Dresden - Rossendorf e.V. (HZDR)
Bautzner Landstr. 400 | 01328 Dresden | Deutschland
Tel: +49 (0) 351 260 2974
<a href="http://www.hzdr.de" target="_blank">http://www.hzdr.de</a>
Vorstand: Prof. Dr. Sebastian M. Schmidt, Dr. Heike Wolke
Vereinsregister: VR 1693 beim Amtsgericht Dresden

Simeon Ehrig
Institute of Radiation Physics 
Helmholtz-Zentrum Dresden - Rossendorf (HZDR)
Bautzner Landstr. 400 | 01328 Dresden | Germany
Phone: +49 351 260 2974
<a href="http://www.hzdr.de" target="_blank">http://www.hzdr.de</a>
Board of Directors: Prof. Dr. Sebastian M. Schmidt, Dr. Heike Wolke
Company Registration Number VR 1693, Amtsgericht Dresden</pre>
  </div>

</blockquote></div>