[llvm-dev] JIT compiling CUDA source code

Mon Nov 23 05:26:15 PST 2020

Hi,

Let me give you a little overview of how the CUDA mode works in Cling.

The workflow of the compiler pipeline is:

1. compile the device code to nvptx
2. wrap the nvptx code in fatbin
3. write the fatbin code to a specific file where the host's LLVM IR 
generator can find it
   - it is more of a workaround because the API of the LLVM IR generator 
does not allow you to pass the fatbin code directly as a string or use a 
virtual file
4. generate the LLVM IR code of the host
   - During generation, the fatbin code is integrated as a text segment 
into the llvm IR code of the hostt
   - the LLVM IR gen generates some CUDA library calls depending on the 
device code, e.g. registration of a kernel
   - often it is done via the global init and deinit functions, which 
are executed before and after the main function - attention, here I had 
some problems in the past because I forgot to call them
   - there was also a problem with non-unique init functions, see here: 
https://github.com/root-project/cling/pull/233/commits/818faeff0ed86c9730334634f6b58241eae1bf12
5. generate the x86 machine code
6. execute

Here is a list of interesting code sections where I have solved the 
problems. The full compiler pipeline for the device code is full 
integrate into Cling and does not depend on the CUDA SDK.

1. We have two Cling::Interpreter instances. One for the host side and 
one for the device side. The Cling::Interpreter class is a central 
concept in Cling. [1]
  - a Cling::Interpreter instance contains different components for 
compiling the source code to machine code
  - in our case it is interesting that
    - the Cling::Interpreter has a Cling::IncrementalParser, which 
contains the Clang::CompilerInstance that generates the LLVM IR module [2]
    - the Clang::CompilerInstance is set up with the same arguments as 
the Clang driver [3][4]
    - Cling::Interpreter does the machine code generation for the host side
    - has an IncrementalCUDADeviceCompiler object which contains the 
Cling::Interpreter instance for the device code [5]
    - the parse function first calls the parse function of the device 
before the host parses the code [6]
2. The IncrementalCUDADeviceCompiler executes all functions to compile 
the source code into the Fatbin and inject it into the LLVM IR generator 
of the host code.
    - at first the LLVM IR code is generated with the 
Clang::CompilerInstance [7]
    - after that, it is compiled to NVPTX
      - I added my own machine code generator instead of changing the 
code generator of Cling::Interpreter - it is a workaround and needs to 
be refactored [8]
    - than the code is wrapped in fatbin [9]
     - Originally I used the tool from Nvida, but then Hal Finkel 
reimplemented the tool in llvm-project-cxxjit [10]
    - the fatbin code is written to a file where the CodeGen of the host 
can find it [11][12][13]

[1] 
https://github.com/root-project/cling/blob/master/include/cling/Interpreter/Interpreter.h
[2] 
https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/Interpreter.cpp#L732
[3] 
https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/CIFactory.cpp#L1302
[4] 
https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/IncrementalCUDADeviceCompiler.cpp#L55
[5] 
https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/include/cling/Interpreter/Interpreter.h#L214
[6] 
https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/Interpreter.cpp#L833
[7] 
https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/IncrementalCUDADeviceCompiler.cpp#L267
[8] 
https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/IncrementalCUDADeviceCompiler.cpp#L286
[9] 
https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/IncrementalCUDADeviceCompiler.cpp#L321
[10] https://github.com/hfinkel/llvm-project-cxxjit
[11] 
https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/IncrementalCUDADeviceCompiler.cpp#L402
[12] 
https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/Interpreter.cpp#L240
[13] 
https://github.com/llvm/llvm-project/blob/101309fe048e66873cfd972c47c4b7e7f2b99f41/clang/lib/CodeGen/CGCUDANV.cpp#L534

On 11/23/20 11:34 AM, Stefan Gränitz wrote:
>> My impression is that he actually uses nvcc to compile the CUDA 
>> kernels, not clang
> The constructor here looks very much like the CUDA command line 
> options are added to a clang::CompilerInstance, I might be wrong, but 
> you could try to follow the trace and see where it ends up:
>
> https://github.com/root-project/cling/blob/master/lib/Interpreter/IncrementalCUDADeviceCompiler.cpp
>
> Disclaimer: I am not familiar with the details of Simeons work or 
> cling or even with JITing CUDA :) Maybe Simeon can confirm or deny my 
> guess.
That is correct. I create a clang::CompilerInstance to compile nvptx.
>
>
> On 22/11/2020 09:09, Vassil Vassilev wrote:
>> Adding Simeon in the loop for Cling and CUDA. 
> Thanks, hi Simeon!
>
>
> On 22/11/2020 09:22, Geoff Levner wrote:
>> Hi, Stefan.
>>
>> Yes, when compiling from the command line, clang does all the work 
>> for you transparently. But behind the scenes it performs two passes: 
>> one to compile source code for the host, and one to compile CUDA 
>> kernels.
>>
>> When compiling in memory, as far as I can tell, you have to perform 
>> those two passes yourself. And the CUDA pass produces a Module that 
>> is incompatible with the host Module. You cannot simply add it to the 
>> JIT. I don't know what to do with it.
>>
>> And yes, I did watch Simeon's presentation, but he didn't get into 
>> that level of detail (or if he did, I missed it). My impression is 
>> that he actually uses nvcc to compile the CUDA kernels, not clang, 
>> using his own parser to separate and adapt the source code...
>>
>> Thanks,
>> Geoff
>>
>>
>> Le dim. 22 nov. 2020 à 01:03, Stefan Gränitz 
>> <stefan.graenitz at gmail.com <mailto:stefan.graenitz at gmail.com>> a écrit :
>>
>>     Hi Geoff
>>
>>     It looks like clang does that altogether:
>>     https://llvm.org/docs/CompileCudaWithLLVM.html
>>
>>     And, probably related: CUDA support has been added to Cling and
>>     there was a presentation for it at the last Dev Meeting
>>     https://www.youtube.com/watch?v=XjjZRhiFDVs
>>
>>     Best,
>>     Stefan
>>
>>     On 20/11/2020 12:09, Geoff Levner via llvm-dev wrote:
>>>     Thanks for that, Valentin.
>>>
>>>     To be sure I understand what you are saying... Assume we are
>>>     talking about a single .cu file containing both a C++ function
>>>     and a CUDA kernel that it invokes, using <<<>>> syntax. Are you
>>>     suggesting that we bypass clang altogether and use the Nvidia
>>>     API to compile and install the CUDA kernel? If we do that, how
>>>     will the JIT-compiled C++ function find the kernel?
>>>
>>>     Geoff
>>>
>>>     On Thu, Nov 19, 2020 at 6:34 PM Valentin Churavy
>>>     <v.churavy at gmail.com <mailto:v.churavy at gmail.com>> wrote:
>>>
>>>         Sound right now like you are emitting an LLVM module?
>>>         The best strategy is probably to use to emit a PTX module
>>>         and then pass that to the CUDA driver. This is what we do on
>>>         the Julia side in CUDA.jl.
>>>
>>>         Nvidia has a somewhat helpful tutorial on this at
>>>         https://github.com/NVIDIA/cuda-samples/blob/c4e2869a2becb4b6d9ce5f64914406bf5e239662/Samples/vectorAdd_nvrtc/vectorAdd.cpp
>>>         and
>>>         https://github.com/NVIDIA/cuda-samples/blob/c4e2869a2becb4b6d9ce5f64914406bf5e239662/Samples/simpleDrvRuntime/simpleDrvRuntime.cpp
>>>
>>>         Hope that helps.
>>>         -V
>>>
>>>
>>>         On Thu, Nov 19, 2020 at 12:11 PM Geoff Levner via llvm-dev
>>>         <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>>
>>>         wrote:
>>>
>>>             I have made a bit of progress... When compiling CUDA
>>>             source code in memory, the Compilation instance returned
>>>             by Driver::BuildCompilation() contains two clang
>>>             Commands: one for the host and one for the CUDA device.
>>>             I can execute both commands using EmitLLVMOnlyActions. I
>>>             add the Module from the host compilation to my JIT as
>>>             usual, but... what to do with the Module from the device
>>>             compilation? If I just add it to the JIT, I get an error
>>>             message like this:
>>>
>>>                 Added modules have incompatible data layouts:
>>>             e-i64:64-i128:128-v16:16-v32:32-n16:32:64 (module) vs
>>>             e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128
>>>             (jit)
>>>
>>>             Any suggestions as to what to do with the Module
>>>             containing CUDA kernel code, so that the host Module can
>>>             invoke it?
>>>
>>>             Geoff
>>>
>>>             On Tue, Nov 17, 2020 at 6:39 PM Geoff Levner
>>>             <glevner at gmail.com <mailto:glevner at gmail.com>> wrote:
>>>
>>>                 We have an application that allows the user to
>>>                 compile and execute C++ code on the fly, using Orc
>>>                 JIT v2, via the LLJIT class. And we would like to
>>>                 extend it to allow the user to provide CUDA source
>>>                 code as well, for GPU programming. But I am having a
>>>                 hard time figuring out how to do it.
>>>
>>>                 To JIT compile C++ code, we do basically as follows:
>>>
>>>                 1. call Driver::BuildCompilation(), which returns a
>>>                 clang Command to execute
>>>                 2. create a CompilerInvocation using the arguments
>>>                 from the Command
>>>                 3. create a CompilerInstance around the
>>>                 CompilerInvocation
>>>                 4. use the CompilerInstance to execute an
>>>                 EmitLLVMOnlyAction
>>>                 5. retrieve the resulting Module from the action and
>>>                 add it to the JIT
>>>
>>>                 But to compile C++ requires only a single clang
>>>                 command. When you add CUDA to the equation, you add
>>>                 several other steps. If you use the clang front end
>>>                 to compile, clang does the following:
>>>
>>>                 1. compiles the driver source code
>>>                 2. compiles the resulting PTX code using the CUDA
>>>                 ptxas command
>>>                 3. builds a "fat binary" using the CUDA fatbinary
>>>                 command
>>>                 4. compiles the host source code and links in the
>>>                 fat binary
>>>
>>>                 So my question is: how do we replicate that process
>>>                 in memory, to generate modules that we can add to
>>>                 our JIT?
>>>
>>>                 I am no CUDA expert, and not much of a clang expert
>>>                 either, so if anyone out there can point me in the
>>>                 right direction, I would be grateful.
>>>
>>>                 Geoff
>>>
>>>             _______________________________________________
>>>             LLVM Developers mailing list
>>>             llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>>>             https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>>
>>>     _______________________________________________
>>>     LLVM Developers mailing list
>>>     llvm-dev at lists.llvm.org  <mailto:llvm-dev at lists.llvm.org>
>>>     https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>     -- 
>>     https://flowcrypt.com/pub/stefan.graenitz@gmail.com
>>
> -- 
> https://flowcrypt.com/pub/stefan.graenitz@gmail.com
Cheers,
Simeon

-- 
Simeon Ehrig
Institut für Strahlenphysik
Helmholtz-Zentrum Dresden - Rossendorf e.V. (HZDR)
Bautzner Landstr. 400 | 01328 Dresden | Deutschland
Tel: +49 (0) 351 260 2974
http://www.hzdr.de
Vorstand: Prof. Dr. Sebastian M. Schmidt, Dr. Heike Wolke
Vereinsregister: VR 1693 beim Amtsgericht Dresden

Simeon Ehrig
Institute of Radiation Physics
Helmholtz-Zentrum Dresden - Rossendorf (HZDR)
Bautzner Landstr. 400 | 01328 Dresden | Germany
Phone: +49 351 260 2974
http://www.hzdr.de
Board of Directors: Prof. Dr. Sebastian M. Schmidt, Dr. Heike Wolke
Company Registration Number VR 1693, Amtsgericht Dresden

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20201123/cbc2da74/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5366 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20201123/cbc2da74/attachment-0001.bin>