<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<p>You're welcome. Unfortunately, the CUDA support is not easy. It
took me months to figure out how to compile CUDA code and months
more to figure out how to make this workflow interactive. If you
have any further questions, just ask them. I'm sure I've forgotten
some details that can be show stoppers ;-)</p>
<p>Cheers,<br>
Simeon<br>
</p>
<div class="moz-cite-prefix">On 11/23/20 2:38 PM, Geoff Levner
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CAHMBa1tKcshaHu_-XX-=kwCX-oLue6JCMoZ1vSZ=L=TSHZ=rSw@mail.gmail.com">
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<div dir="ltr">
<div>Now THAT answers my question. Thank you very much, Simeon!
(I was hoping there would be a simpler answer, of course...)<br>
</div>
<div><br>
</div>
<div>Geoff</div>
<div><br>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Mon, Nov 23, 2020 at 2:26
PM Simeon Ehrig <<a href="mailto:s.ehrig@hzdr.de"
moz-do-not-send="true">s.ehrig@hzdr.de</a>> wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
<p>Hi,</p>
Let me give you a little overview of how the CUDA mode works
in Cling.<br>
<br>
The workflow of the compiler pipeline is:
<p>1. compile the device code to nvptx<br>
2. wrap the nvptx code in fatbin<br>
3. write the fatbin code to a specific file where the
host's LLVM IR generator can find it<br>
- it is more of a workaround because the API of the LLVM
IR generator does not allow you to pass the fatbin code
directly as a string or use a virtual file<br>
4. generate the LLVM IR code of the host<br>
- During generation, the fatbin code is integrated as a
text segment into the llvm IR code of the hostt<br>
- the LLVM IR gen generates some CUDA library calls
depending on the device code, e.g. registration of a
kernel<br>
- often it is done via the global init and deinit
functions, which are executed before and after the main
function - attention, here I had some problems in the past
because I forgot to call them<br>
- there was also a problem with non-unique init
functions, see here:
<a
href="https://github.com/root-project/cling/pull/233/commits/818faeff0ed86c9730334634f6b58241eae1bf12"
target="_blank" moz-do-not-send="true">https://github.com/root-project/cling/pull/233/commits/818faeff0ed86c9730334634f6b58241eae1bf12</a><br>
5. generate the x86 machine code<br>
6. execute</p>
<p>Here is a list of interesting code sections where I have
solved the problems. The full compiler pipeline for the
device code is full integrate into Cling and does not
depend on the CUDA SDK.</p>
<p>1. We have two Cling::Interpreter instances. One for the
host side and one for the device side. The
Cling::Interpreter class is a central concept in Cling.
[1]<br>
- a Cling::Interpreter instance contains different
components for compiling the source code to machine code<br>
- in our case it is interesting that <br>
- the Cling::Interpreter has a
Cling::IncrementalParser, which contains the
Clang::CompilerInstance that generates the LLVM IR module
[2]<br>
- the Clang::CompilerInstance is set up with the same
arguments as the Clang driver [3][4]<br>
- Cling::Interpreter does the machine code generation
for the host side<br>
- has an IncrementalCUDADeviceCompiler object which
contains the Cling::Interpreter instance for the device
code [5]<br>
- the parse function first calls the parse function of
the device before the host parses the code [6]<br>
2. The IncrementalCUDADeviceCompiler executes all
functions to compile the source code into the Fatbin and
inject it into the LLVM IR generator of the host code.<br>
- at first the LLVM IR code is generated with the
Clang::CompilerInstance [7]<br>
- after that, it is compiled to NVPTX <br>
- I added my own machine code generator instead of
changing the code generator of Cling::Interpreter - it is
a workaround and needs to be refactored [8]<br>
- than the code is wrapped in fatbin [9]<br>
- Originally I used the tool from Nvida, but then Hal
Finkel reimplemented the tool in llvm-project-cxxjit [10]<br>
- the fatbin code is written to a file where the
CodeGen of the host can find it [11][12][13]<br>
</p>
<p>[1]
<a
href="https://github.com/root-project/cling/blob/master/include/cling/Interpreter/Interpreter.h"
target="_blank" moz-do-not-send="true">https://github.com/root-project/cling/blob/master/include/cling/Interpreter/Interpreter.h</a><br>
[2]
<a
href="https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/Interpreter.cpp#L732"
target="_blank" moz-do-not-send="true">https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/Interpreter.cpp#L732</a><br>
[3]
<a
href="https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/CIFactory.cpp#L1302"
target="_blank" moz-do-not-send="true">https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/CIFactory.cpp#L1302</a><br>
[4]
<a
href="https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/IncrementalCUDADeviceCompiler.cpp#L55"
target="_blank" moz-do-not-send="true">https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/IncrementalCUDADeviceCompiler.cpp#L55</a><br>
[5]
<a
href="https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/include/cling/Interpreter/Interpreter.h#L214"
target="_blank" moz-do-not-send="true">https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/include/cling/Interpreter/Interpreter.h#L214</a><br>
[6]
<a
href="https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/Interpreter.cpp#L833"
target="_blank" moz-do-not-send="true">https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/Interpreter.cpp#L833</a><br>
[7]
<a
href="https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/IncrementalCUDADeviceCompiler.cpp#L267"
target="_blank" moz-do-not-send="true">https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/IncrementalCUDADeviceCompiler.cpp#L267</a><br>
[8]
<a
href="https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/IncrementalCUDADeviceCompiler.cpp#L286"
target="_blank" moz-do-not-send="true">https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/IncrementalCUDADeviceCompiler.cpp#L286</a><br>
[9]
<a
href="https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/IncrementalCUDADeviceCompiler.cpp#L321"
target="_blank" moz-do-not-send="true">https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/IncrementalCUDADeviceCompiler.cpp#L321</a><br>
[10] <a
href="https://github.com/hfinkel/llvm-project-cxxjit"
target="_blank" moz-do-not-send="true">https://github.com/hfinkel/llvm-project-cxxjit</a><br>
[11]
<a
href="https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/IncrementalCUDADeviceCompiler.cpp#L402"
target="_blank" moz-do-not-send="true">https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/IncrementalCUDADeviceCompiler.cpp#L402</a><br>
[12]
<a
href="https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/Interpreter.cpp#L240"
target="_blank" moz-do-not-send="true">https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/Interpreter.cpp#L240</a><br>
[13]
<a
href="https://github.com/llvm/llvm-project/blob/101309fe048e66873cfd972c47c4b7e7f2b99f41/clang/lib/CodeGen/CGCUDANV.cpp#L534"
target="_blank" moz-do-not-send="true">https://github.com/llvm/llvm-project/blob/101309fe048e66873cfd972c47c4b7e7f2b99f41/clang/lib/CodeGen/CGCUDANV.cpp#L534</a><br>
</p>
<div>On 11/23/20 11:34 AM, Stefan Gränitz wrote:<br>
</div>
<blockquote type="cite">
<blockquote type="cite">My impression is that he actually
uses nvcc to compile the CUDA kernels, not clang</blockquote>
The constructor here looks very much like the CUDA command
line options are added to a clang::CompilerInstance, I
might be wrong, but you could try to follow the trace and
see where it ends up:<br>
<br>
<a
href="https://github.com/root-project/cling/blob/master/lib/Interpreter/IncrementalCUDADeviceCompiler.cpp"
target="_blank" moz-do-not-send="true">https://github.com/root-project/cling/blob/master/lib/Interpreter/IncrementalCUDADeviceCompiler.cpp</a><br>
<br>
Disclaimer: I am not familiar with the details of Simeons
work or cling or even with JITing CUDA :) Maybe Simeon can
confirm or deny my guess.<br>
</blockquote>
That is correct. I create a clang::CompilerInstance to
compile nvptx.<br>
<blockquote type="cite"> <br>
<br>
On 22/11/2020 09:09, Vassil Vassilev wrote:<br>
<blockquote type="cite"> Adding Simeon in the loop for
Cling and CUDA. </blockquote>
Thanks, hi Simeon!<br>
<br>
<br>
<div>On 22/11/2020 09:22, Geoff Levner wrote:<br>
</div>
<blockquote type="cite">
<div dir="auto">
<div>Hi, Stefan.
<div dir="auto"><br>
</div>
<div dir="auto">Yes, when compiling from the command
line, clang does all the work for you
transparently. But behind the scenes it performs
two passes: one to compile source code for the
host, and one to compile CUDA kernels. </div>
<div dir="auto"><br>
</div>
<div dir="auto">When compiling in memory, as far as
I can tell, you have to perform those two passes
yourself. And the CUDA pass produces a Module that
is incompatible with the host Module. You cannot
simply add it to the JIT. I don't know what to do
with it. </div>
<div dir="auto"><br>
</div>
<div dir="auto">And yes, I did watch Simeon's
presentation, but he didn't get into that level of
detail (or if he did, I missed it). My impression
is that he actually uses nvcc to compile the CUDA
kernels, not clang, using his own parser to
separate and adapt the source code... </div>
<div dir="auto"><br>
</div>
<div dir="auto">Thanks, </div>
<div dir="auto">Geoff </div>
<br>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">Le dim. 22 nov.
2020 à 01:03, Stefan Gränitz <<a
href="mailto:stefan.graenitz@gmail.com"
rel="noreferrer" target="_blank"
moz-do-not-send="true">stefan.graenitz@gmail.com</a>>
a écrit :<br>
</div>
<blockquote class="gmail_quote" style="margin:0px
0px 0px 0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div> Hi Geoff<br>
<br>
It looks like clang does that altogether: <a
href="https://llvm.org/docs/CompileCudaWithLLVM.html" rel="noreferrer
noreferrer" target="_blank"
moz-do-not-send="true">https://llvm.org/docs/CompileCudaWithLLVM.html</a><br>
<br>
And, probably related: CUDA support has been
added to Cling and there was a presentation
for it at the last Dev Meeting <a
href="https://www.youtube.com/watch?v=XjjZRhiFDVs"
rel="noreferrer noreferrer" target="_blank"
moz-do-not-send="true">https://www.youtube.com/watch?v=XjjZRhiFDVs</a><br>
<br>
Best,<br>
Stefan<br>
<br>
<div>On 20/11/2020 12:09, Geoff Levner via
llvm-dev wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div>Thanks for that, Valentin.</div>
<div><br>
</div>
<div>To be sure I understand what you are
saying... Assume we are talking about a
single .cu file containing both a C++
function and a CUDA kernel that it
invokes, using <<<>>>
syntax. Are you suggesting that we
bypass clang altogether and use the
Nvidia API to compile and install the
CUDA kernel? If we do that, how will the
JIT-compiled C++ function find the
kernel?</div>
<div><br>
</div>
<div>Geoff<br>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Thu,
Nov 19, 2020 at 6:34 PM Valentin Churavy
<<a href="mailto:v.churavy@gmail.com"
rel="noreferrer noreferrer"
target="_blank" moz-do-not-send="true">v.churavy@gmail.com</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote"
style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div dir="ltr">
<div>Sound right now like you are
emitting an LLVM module?<br>
</div>
<div>The best strategy is probably to
use to emit a PTX module and then
pass that to the CUDA driver. This
is what we do on the Julia side in
CUDA.jl.</div>
<div><br>
</div>
<div>Nvidia has a somewhat helpful
tutorial on this at <a
href="https://github.com/NVIDIA/cuda-samples/blob/c4e2869a2becb4b6d9ce5f64914406bf5e239662/Samples/vectorAdd_nvrtc/vectorAdd.cpp"
rel="noreferrer noreferrer"
target="_blank"
moz-do-not-send="true">https://github.com/NVIDIA/cuda-samples/blob/c4e2869a2becb4b6d9ce5f64914406bf5e239662/Samples/vectorAdd_nvrtc/vectorAdd.cpp</a></div>
<div>and <a
href="https://github.com/NVIDIA/cuda-samples/blob/c4e2869a2becb4b6d9ce5f64914406bf5e239662/Samples/simpleDrvRuntime/simpleDrvRuntime.cpp"
rel="noreferrer noreferrer"
target="_blank"
moz-do-not-send="true">https://github.com/NVIDIA/cuda-samples/blob/c4e2869a2becb4b6d9ce5f64914406bf5e239662/Samples/simpleDrvRuntime/simpleDrvRuntime.cpp</a></div>
<div><br>
</div>
<div>Hope that helps.</div>
<div>-V<br>
</div>
<div><br>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On
Thu, Nov 19, 2020 at 12:11 PM Geoff
Levner via llvm-dev <<a
href="mailto:llvm-dev@lists.llvm.org"
rel="noreferrer noreferrer"
target="_blank"
moz-do-not-send="true">llvm-dev@lists.llvm.org</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote"
style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div dir="ltr">
<div>I have made a bit of
progress... When compiling CUDA
source code in memory, the
Compilation instance returned by
Driver::BuildCompilation()
contains two clang Commands: one
for the host and one for the
CUDA device. I can execute both
commands using
EmitLLVMOnlyActions. I add the
Module from the host compilation
to my JIT as usual, but... what
to do with the Module from the
device compilation? If I just
add it to the JIT, I get an
error message like this:</div>
<div><br>
</div>
<div> Added modules have
incompatible data layouts:
e-i64:64-i128:128-v16:16-v32:32-n16:32:64
(module) vs
e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128
(jit)</div>
<div><br>
</div>
<div>Any suggestions as to what to
do with the Module containing
CUDA kernel code, so that the
host Module can invoke it?</div>
<div><br>
</div>
<div>Geoff<br>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr"
class="gmail_attr">On Tue, Nov
17, 2020 at 6:39 PM Geoff
Levner <<a
href="mailto:glevner@gmail.com"
rel="noreferrer noreferrer"
target="_blank"
moz-do-not-send="true">glevner@gmail.com</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote"
style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div dir="ltr">
<div>We have an application
that allows the user to
compile and execute C++
code on the fly, using Orc
JIT v2, via the LLJIT
class. And we would like
to extend it to allow the
user to provide CUDA
source code as well, for
GPU programming. But I am
having a hard time
figuring out how to do it.</div>
<div><br>
</div>
<div>To JIT compile C++
code, we do basically as
follows:</div>
<div><br>
</div>
<div>1. call
Driver::BuildCompilation(),
which returns a clang
Command to execute</div>
<div>2. create a
CompilerInvocation using
the arguments from the
Command</div>
<div>3. create a
CompilerInstance around
the CompilerInvocation</div>
<div>4. use the
CompilerInstance to
execute an
EmitLLVMOnlyAction</div>
<div>5. retrieve the
resulting Module from the
action and add it to the
JIT</div>
<div><br>
</div>
<div>But to compile C++
requires only a single
clang command. When you
add CUDA to the equation,
you add several other
steps. If you use the
clang front end to
compile, clang does the
following:</div>
<div><br>
</div>
<div>1. compiles the driver
source code<br>
</div>
<div>2. compiles the
resulting PTX code using
the CUDA ptxas command<br>
</div>
<div>3. builds a "fat
binary" using the CUDA
fatbinary command</div>
<div>4. compiles the host
source code and links in
the fat binary</div>
<div><br>
</div>
<div>So my question is: how
do we replicate that
process in memory, to
generate modules that we
can add to our JIT?</div>
<div><br>
</div>
<div>I am no CUDA expert,
and not much of a clang
expert either, so if
anyone out there can point
me in the right direction,
I would be grateful.</div>
<div><br>
</div>
<div>Geoff</div>
<div><br>
</div>
</div>
</blockquote>
</div>
</div>
_______________________________________________<br>
LLVM Developers mailing list<br>
<a
href="mailto:llvm-dev@lists.llvm.org"
rel="noreferrer noreferrer"
target="_blank"
moz-do-not-send="true">llvm-dev@lists.llvm.org</a><br>
<a
href="https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev"
rel="noreferrer noreferrer
noreferrer" target="_blank"
moz-do-not-send="true">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><br>
</blockquote>
</div>
</blockquote>
</div>
<br>
<fieldset></fieldset>
<pre>_______________________________________________
LLVM Developers mailing list
<a href="mailto:llvm-dev@lists.llvm.org" rel="noreferrer noreferrer" target="_blank" moz-do-not-send="true">llvm-dev@lists.llvm.org</a>
<a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" rel="noreferrer noreferrer" target="_blank" moz-do-not-send="true">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a>
</pre>
</blockquote>
<pre cols="72">--
<a href="https://flowcrypt.com/pub/stefan.graenitz@gmail.com" rel="noreferrer noreferrer" target="_blank" moz-do-not-send="true">https://flowcrypt.com/pub/stefan.graenitz@gmail.com</a></pre>
</div>
</blockquote>
</div>
</div>
</div>
</blockquote>
<pre cols="72">--
<a href="https://flowcrypt.com/pub/stefan.graenitz@gmail.com" target="_blank" moz-do-not-send="true">https://flowcrypt.com/pub/stefan.graenitz@gmail.com</a></pre>
</blockquote>
Cheers,<br>
Simeon<br>
<pre cols="72">--
Simeon Ehrig
Institut für Strahlenphysik
Helmholtz-Zentrum Dresden - Rossendorf e.V. (HZDR)
Bautzner Landstr. 400 | 01328 Dresden | Deutschland
Tel: +49 (0) 351 260 2974
<a href="http://www.hzdr.de" target="_blank" moz-do-not-send="true">http://www.hzdr.de</a>
Vorstand: Prof. Dr. Sebastian M. Schmidt, Dr. Heike Wolke
Vereinsregister: VR 1693 beim Amtsgericht Dresden
Simeon Ehrig
Institute of Radiation Physics
Helmholtz-Zentrum Dresden - Rossendorf (HZDR)
Bautzner Landstr. 400 | 01328 Dresden | Germany
Phone: +49 351 260 2974
<a href="http://www.hzdr.de" target="_blank" moz-do-not-send="true">http://www.hzdr.de</a>
Board of Directors: Prof. Dr. Sebastian M. Schmidt, Dr. Heike Wolke
Company Registration Number VR 1693, Amtsgericht Dresden</pre>
</div>
</blockquote>
</div>
</blockquote>
<pre class="moz-signature" cols="72">--
Simeon Ehrig
Institut für Strahlenphysik
Helmholtz-Zentrum Dresden - Rossendorf e.V. (HZDR)
Bautzner Landstr. 400 | 01328 Dresden | Deutschland
Tel: +49 (0) 351 260 2974
<a class="moz-txt-link-freetext" href="http://www.hzdr.de">http://www.hzdr.de</a>
Vorstand: Prof. Dr. Sebastian M. Schmidt, Dr. Heike Wolke
Vereinsregister: VR 1693 beim Amtsgericht Dresden
Simeon Ehrig
Institute of Radiation Physics
Helmholtz-Zentrum Dresden - Rossendorf (HZDR)
Bautzner Landstr. 400 | 01328 Dresden | Germany
Phone: +49 351 260 2974
<a class="moz-txt-link-freetext" href="http://www.hzdr.de">http://www.hzdr.de</a>
Board of Directors: Prof. Dr. Sebastian M. Schmidt, Dr. Heike Wolke
Company Registration Number VR 1693, Amtsgericht Dresden</pre>
</body>
</html>