[llvm-dev] JIT compiling CUDA source code

Mon Nov 23 05:45:18 PST 2020

You're welcome. Unfortunately, the CUDA support is not easy. It took me 
months to figure out how to compile CUDA code and months more to figure 
out how to make this workflow interactive. If you have any further 
questions, just ask them. I'm sure I've forgotten some details that can 
be show stoppers ;-)

Cheers,
Simeon

On 11/23/20 2:38 PM, Geoff Levner wrote:
> Now THAT answers my question. Thank you very much, Simeon! (I was 
> hoping there would be a simpler answer, of course...)
>
> Geoff
>
>
> On Mon, Nov 23, 2020 at 2:26 PM Simeon Ehrig <s.ehrig at hzdr.de 
> <mailto:s.ehrig at hzdr.de>> wrote:
>
>     Hi,
>
>     Let me give you a little overview of how the CUDA mode works in Cling.
>
>     The workflow of the compiler pipeline is:
>
>     1. compile the device code to nvptx
>     2. wrap the nvptx code in fatbin
>     3. write the fatbin code to a specific file where the host's LLVM
>     IR generator can find it
>       - it is more of a workaround because the API of the LLVM IR
>     generator does not allow you to pass the fatbin code directly as a
>     string or use a virtual file
>     4. generate the LLVM IR code of the host
>       - During generation, the fatbin code is integrated as a text
>     segment into the llvm IR code of the hostt
>       - the LLVM IR gen generates some CUDA library calls depending on
>     the device code, e.g. registration of a kernel
>       - often it is done via the global init and deinit functions,
>     which are executed before and after the main function - attention,
>     here I had some problems in the past because I forgot to call them
>       - there was also a problem with non-unique init functions, see
>     here:
>     https://github.com/root-project/cling/pull/233/commits/818faeff0ed86c9730334634f6b58241eae1bf12
>     5. generate the x86 machine code
>     6. execute
>
>     Here is a list of interesting code sections where I have solved
>     the problems. The full compiler pipeline for the device code is
>     full integrate into Cling and does not depend on the CUDA SDK.
>
>     1. We have two Cling::Interpreter instances. One for the host side
>     and one for the device side. The Cling::Interpreter class is a
>     central concept in Cling. [1]
>      - a Cling::Interpreter instance contains different components for
>     compiling the source code to machine code
>      - in our case it is interesting that
>        - the Cling::Interpreter has a Cling::IncrementalParser, which
>     contains the Clang::CompilerInstance that generates the LLVM IR
>     module [2]
>        - the Clang::CompilerInstance is set up with the same arguments
>     as the Clang driver [3][4]
>        - Cling::Interpreter does the machine code generation for the
>     host side
>        - has an IncrementalCUDADeviceCompiler object which contains
>     the Cling::Interpreter instance for the device code [5]
>        - the parse function first calls the parse function of the
>     device before the host parses the code [6]
>     2. The IncrementalCUDADeviceCompiler executes all functions to
>     compile the source code into the Fatbin and inject it into the
>     LLVM IR generator of the host code.
>        - at first the LLVM IR code is generated with the
>     Clang::CompilerInstance [7]
>        - after that, it is compiled to NVPTX
>          - I added my own machine code generator instead of changing
>     the code generator of Cling::Interpreter - it is a workaround and
>     needs to be refactored [8]
>        - than the code is wrapped in fatbin [9]
>         - Originally I used the tool from Nvida, but then Hal Finkel
>     reimplemented the tool in llvm-project-cxxjit [10]
>        - the fatbin code is written to a file where the CodeGen of the
>     host can find it [11][12][13]
>
>     [1]
>     https://github.com/root-project/cling/blob/master/include/cling/Interpreter/Interpreter.h
>     [2]
>     https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/Interpreter.cpp#L732
>     [3]
>     https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/CIFactory.cpp#L1302
>     [4]
>     https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/IncrementalCUDADeviceCompiler.cpp#L55
>     [5]
>     https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/include/cling/Interpreter/Interpreter.h#L214
>     [6]
>     https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/Interpreter.cpp#L833
>     [7]
>     https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/IncrementalCUDADeviceCompiler.cpp#L267
>     [8]
>     https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/IncrementalCUDADeviceCompiler.cpp#L286
>     [9]
>     https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/IncrementalCUDADeviceCompiler.cpp#L321
>     [10] https://github.com/hfinkel/llvm-project-cxxjit
>     [11]
>     https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/IncrementalCUDADeviceCompiler.cpp#L402
>     [12]
>     https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/Interpreter.cpp#L240
>     [13]
>     https://github.com/llvm/llvm-project/blob/101309fe048e66873cfd972c47c4b7e7f2b99f41/clang/lib/CodeGen/CGCUDANV.cpp#L534
>
>     On 11/23/20 11:34 AM, Stefan Gränitz wrote:
>>>     My impression is that he actually uses nvcc to compile the CUDA
>>>     kernels, not clang
>>     The constructor here looks very much like the CUDA command line
>>     options are added to a clang::CompilerInstance, I might be wrong,
>>     but you could try to follow the trace and see where it ends up:
>>
>>     https://github.com/root-project/cling/blob/master/lib/Interpreter/IncrementalCUDADeviceCompiler.cpp
>>
>>     Disclaimer: I am not familiar with the details of Simeons work or
>>     cling or even with JITing CUDA :) Maybe Simeon can confirm or
>>     deny my guess.
>     That is correct. I create a clang::CompilerInstance to compile nvptx.
>>
>>
>>     On 22/11/2020 09:09, Vassil Vassilev wrote:
>>>     Adding Simeon in the loop for Cling and CUDA. 
>>     Thanks, hi Simeon!
>>
>>
>>     On 22/11/2020 09:22, Geoff Levner wrote:
>>>     Hi, Stefan.
>>>
>>>     Yes, when compiling from the command line, clang does all the
>>>     work for you transparently. But behind the scenes it performs
>>>     two passes: one to compile source code for the host, and one to
>>>     compile CUDA kernels.
>>>
>>>     When compiling in memory, as far as I can tell, you have to
>>>     perform those two passes yourself. And the CUDA pass produces a
>>>     Module that is incompatible with the host Module. You cannot
>>>     simply add it to the JIT. I don't know what to do with it.
>>>
>>>     And yes, I did watch Simeon's presentation, but he didn't get
>>>     into that level of detail (or if he did, I missed it). My
>>>     impression is that he actually uses nvcc to compile the CUDA
>>>     kernels, not clang, using his own parser to separate and adapt
>>>     the source code...
>>>
>>>     Thanks,
>>>     Geoff
>>>
>>>
>>>     Le dim. 22 nov. 2020 à 01:03, Stefan Gränitz
>>>     <stefan.graenitz at gmail.com <mailto:stefan.graenitz at gmail.com>> a
>>>     écrit :
>>>
>>>         Hi Geoff
>>>
>>>         It looks like clang does that altogether:
>>>         https://llvm.org/docs/CompileCudaWithLLVM.html
>>>
>>>         And, probably related: CUDA support has been added to Cling
>>>         and there was a presentation for it at the last Dev Meeting
>>>         https://www.youtube.com/watch?v=XjjZRhiFDVs
>>>
>>>         Best,
>>>         Stefan
>>>
>>>         On 20/11/2020 12:09, Geoff Levner via llvm-dev wrote:
>>>>         Thanks for that, Valentin.
>>>>
>>>>         To be sure I understand what you are saying... Assume we
>>>>         are talking about a single .cu file containing both a C++
>>>>         function and a CUDA kernel that it invokes, using <<<>>>
>>>>         syntax. Are you suggesting that we bypass clang altogether
>>>>         and use the Nvidia API to compile and install the CUDA
>>>>         kernel? If we do that, how will the JIT-compiled C++
>>>>         function find the kernel?
>>>>
>>>>         Geoff
>>>>
>>>>         On Thu, Nov 19, 2020 at 6:34 PM Valentin Churavy
>>>>         <v.churavy at gmail.com <mailto:v.churavy at gmail.com>> wrote:
>>>>
>>>>             Sound right now like you are emitting an LLVM module?
>>>>             The best strategy is probably to use to emit a PTX
>>>>             module and then pass that to the  CUDA driver. This is
>>>>             what we do on the Julia side in CUDA.jl.
>>>>
>>>>             Nvidia has a somewhat helpful tutorial on this at
>>>>             https://github.com/NVIDIA/cuda-samples/blob/c4e2869a2becb4b6d9ce5f64914406bf5e239662/Samples/vectorAdd_nvrtc/vectorAdd.cpp
>>>>             and
>>>>             https://github.com/NVIDIA/cuda-samples/blob/c4e2869a2becb4b6d9ce5f64914406bf5e239662/Samples/simpleDrvRuntime/simpleDrvRuntime.cpp
>>>>
>>>>             Hope that helps.
>>>>             -V
>>>>
>>>>
>>>>             On Thu, Nov 19, 2020 at 12:11 PM Geoff Levner via
>>>>             llvm-dev <llvm-dev at lists.llvm.org
>>>>             <mailto:llvm-dev at lists.llvm.org>> wrote:
>>>>
>>>>                 I have made a bit of progress... When compiling
>>>>                 CUDA source code in memory, the Compilation
>>>>                 instance returned by Driver::BuildCompilation()
>>>>                 contains two clang Commands: one for the host and
>>>>                 one for the CUDA device. I can execute both
>>>>                 commands using EmitLLVMOnlyActions. I add the
>>>>                 Module from the host compilation to my JIT as
>>>>                 usual, but... what to do with the Module from the
>>>>                 device compilation? If I just add it to the JIT, I
>>>>                 get an error message like this:
>>>>
>>>>                     Added modules have incompatible data layouts:
>>>>                 e-i64:64-i128:128-v16:16-v32:32-n16:32:64 (module)
>>>>                 vs
>>>>                 e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128
>>>>                 (jit)
>>>>
>>>>                 Any suggestions as to what to do with the Module
>>>>                 containing CUDA kernel code, so that the host
>>>>                 Module can invoke it?
>>>>
>>>>                 Geoff
>>>>
>>>>                 On Tue, Nov 17, 2020 at 6:39 PM Geoff Levner
>>>>                 <glevner at gmail.com <mailto:glevner at gmail.com>> wrote:
>>>>
>>>>                     We have an application that allows the user to
>>>>                     compile and execute C++ code on the fly, using
>>>>                     Orc JIT v2, via the LLJIT class. And we would
>>>>                     like to extend it to allow the user to provide
>>>>                     CUDA source code as well, for GPU programming.
>>>>                     But I am having a hard time figuring out how to
>>>>                     do it.
>>>>
>>>>                     To JIT compile C++ code, we do basically as
>>>>                     follows:
>>>>
>>>>                     1. call Driver::BuildCompilation(), which
>>>>                     returns a clang Command to execute
>>>>                     2. create a CompilerInvocation using the
>>>>                     arguments from the Command
>>>>                     3. create a CompilerInstance around the
>>>>                     CompilerInvocation
>>>>                     4. use the CompilerInstance to execute an
>>>>                     EmitLLVMOnlyAction
>>>>                     5. retrieve the resulting Module from the
>>>>                     action and add it to the JIT
>>>>
>>>>                     But to compile C++ requires only a single clang
>>>>                     command. When you add CUDA to the equation, you
>>>>                     add several other steps. If you use the clang
>>>>                     front end to compile, clang does the following:
>>>>
>>>>                     1. compiles the driver source code
>>>>                     2. compiles the resulting PTX code using the
>>>>                     CUDA ptxas command
>>>>                     3. builds a "fat binary" using the CUDA
>>>>                     fatbinary command
>>>>                     4. compiles the host source code and links in
>>>>                     the fat binary
>>>>
>>>>                     So my question is: how do we replicate that
>>>>                     process in memory, to generate modules that we
>>>>                     can add to our JIT?
>>>>
>>>>                     I am no CUDA expert, and not much of a clang
>>>>                     expert either, so if anyone out there can point
>>>>                     me in the right direction, I would be grateful.
>>>>
>>>>                     Geoff
>>>>
>>>>                 _______________________________________________
>>>>                 LLVM Developers mailing list
>>>>                 llvm-dev at lists.llvm.org
>>>>                 <mailto:llvm-dev at lists.llvm.org>
>>>>                 https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>
>>>>
>>>>         _______________________________________________
>>>>         LLVM Developers mailing list
>>>>         llvm-dev at lists.llvm.org  <mailto:llvm-dev at lists.llvm.org>
>>>>         https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>>         -- 
>>>         https://flowcrypt.com/pub/stefan.graenitz@gmail.com
>>>
>>     -- 
>>     https://flowcrypt.com/pub/stefan.graenitz@gmail.com
>     Cheers,
>     Simeon
>
>     -- 
>     Simeon Ehrig
>     Institut für Strahlenphysik
>     Helmholtz-Zentrum Dresden - Rossendorf e.V. (HZDR)
>     Bautzner Landstr. 400 | 01328 Dresden | Deutschland
>     Tel: +49 (0) 351 260 2974
>     http://www.hzdr.de
>     Vorstand: Prof. Dr. Sebastian M. Schmidt, Dr. Heike Wolke
>     Vereinsregister: VR 1693 beim Amtsgericht Dresden
>
>     Simeon Ehrig
>     Institute of Radiation Physics
>     Helmholtz-Zentrum Dresden - Rossendorf (HZDR)
>     Bautzner Landstr. 400 | 01328 Dresden | Germany
>     Phone: +49 351 260 2974
>     http://www.hzdr.de
>     Board of Directors: Prof. Dr. Sebastian M. Schmidt, Dr. Heike Wolke
>     Company Registration Number VR 1693, Amtsgericht Dresden
>
-- 
Simeon Ehrig
Institut für Strahlenphysik
Helmholtz-Zentrum Dresden - Rossendorf e.V. (HZDR)
Bautzner Landstr. 400 | 01328 Dresden | Deutschland
Tel: +49 (0) 351 260 2974
http://www.hzdr.de
Vorstand: Prof. Dr. Sebastian M. Schmidt, Dr. Heike Wolke
Vereinsregister: VR 1693 beim Amtsgericht Dresden

Simeon Ehrig
Institute of Radiation Physics
Helmholtz-Zentrum Dresden - Rossendorf (HZDR)
Bautzner Landstr. 400 | 01328 Dresden | Germany
Phone: +49 351 260 2974
http://www.hzdr.de
Board of Directors: Prof. Dr. Sebastian M. Schmidt, Dr. Heike Wolke
Company Registration Number VR 1693, Amtsgericht Dresden

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20201123/971a33fa/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5366 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20201123/971a33fa/attachment-0001.bin>