[llvm-dev] JIT compiling CUDA source code
Simeon Ehrig via llvm-dev
llvm-dev at lists.llvm.org
Mon Nov 23 05:26:15 PST 2020
Hi,
Let me give you a little overview of how the CUDA mode works in Cling.
The workflow of the compiler pipeline is:
1. compile the device code to nvptx
2. wrap the nvptx code in fatbin
3. write the fatbin code to a specific file where the host's LLVM IR
generator can find it
- it is more of a workaround because the API of the LLVM IR generator
does not allow you to pass the fatbin code directly as a string or use a
virtual file
4. generate the LLVM IR code of the host
- During generation, the fatbin code is integrated as a text segment
into the llvm IR code of the hostt
- the LLVM IR gen generates some CUDA library calls depending on the
device code, e.g. registration of a kernel
- often it is done via the global init and deinit functions, which
are executed before and after the main function - attention, here I had
some problems in the past because I forgot to call them
- there was also a problem with non-unique init functions, see here:
https://github.com/root-project/cling/pull/233/commits/818faeff0ed86c9730334634f6b58241eae1bf12
5. generate the x86 machine code
6. execute
Here is a list of interesting code sections where I have solved the
problems. The full compiler pipeline for the device code is full
integrate into Cling and does not depend on the CUDA SDK.
1. We have two Cling::Interpreter instances. One for the host side and
one for the device side. The Cling::Interpreter class is a central
concept in Cling. [1]
- a Cling::Interpreter instance contains different components for
compiling the source code to machine code
- in our case it is interesting that
- the Cling::Interpreter has a Cling::IncrementalParser, which
contains the Clang::CompilerInstance that generates the LLVM IR module [2]
- the Clang::CompilerInstance is set up with the same arguments as
the Clang driver [3][4]
- Cling::Interpreter does the machine code generation for the host side
- has an IncrementalCUDADeviceCompiler object which contains the
Cling::Interpreter instance for the device code [5]
- the parse function first calls the parse function of the device
before the host parses the code [6]
2. The IncrementalCUDADeviceCompiler executes all functions to compile
the source code into the Fatbin and inject it into the LLVM IR generator
of the host code.
- at first the LLVM IR code is generated with the
Clang::CompilerInstance [7]
- after that, it is compiled to NVPTX
- I added my own machine code generator instead of changing the
code generator of Cling::Interpreter - it is a workaround and needs to
be refactored [8]
- than the code is wrapped in fatbin [9]
- Originally I used the tool from Nvida, but then Hal Finkel
reimplemented the tool in llvm-project-cxxjit [10]
- the fatbin code is written to a file where the CodeGen of the host
can find it [11][12][13]
[1]
https://github.com/root-project/cling/blob/master/include/cling/Interpreter/Interpreter.h
[2]
https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/Interpreter.cpp#L732
[3]
https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/CIFactory.cpp#L1302
[4]
https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/IncrementalCUDADeviceCompiler.cpp#L55
[5]
https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/include/cling/Interpreter/Interpreter.h#L214
[6]
https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/Interpreter.cpp#L833
[7]
https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/IncrementalCUDADeviceCompiler.cpp#L267
[8]
https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/IncrementalCUDADeviceCompiler.cpp#L286
[9]
https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/IncrementalCUDADeviceCompiler.cpp#L321
[10] https://github.com/hfinkel/llvm-project-cxxjit
[11]
https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/IncrementalCUDADeviceCompiler.cpp#L402
[12]
https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/Interpreter.cpp#L240
[13]
https://github.com/llvm/llvm-project/blob/101309fe048e66873cfd972c47c4b7e7f2b99f41/clang/lib/CodeGen/CGCUDANV.cpp#L534
On 11/23/20 11:34 AM, Stefan Gränitz wrote:
>> My impression is that he actually uses nvcc to compile the CUDA
>> kernels, not clang
> The constructor here looks very much like the CUDA command line
> options are added to a clang::CompilerInstance, I might be wrong, but
> you could try to follow the trace and see where it ends up:
>
> https://github.com/root-project/cling/blob/master/lib/Interpreter/IncrementalCUDADeviceCompiler.cpp
>
> Disclaimer: I am not familiar with the details of Simeons work or
> cling or even with JITing CUDA :) Maybe Simeon can confirm or deny my
> guess.
That is correct. I create a clang::CompilerInstance to compile nvptx.
>
>
> On 22/11/2020 09:09, Vassil Vassilev wrote:
>> Adding Simeon in the loop for Cling and CUDA.
> Thanks, hi Simeon!
>
>
> On 22/11/2020 09:22, Geoff Levner wrote:
>> Hi, Stefan.
>>
>> Yes, when compiling from the command line, clang does all the work
>> for you transparently. But behind the scenes it performs two passes:
>> one to compile source code for the host, and one to compile CUDA
>> kernels.
>>
>> When compiling in memory, as far as I can tell, you have to perform
>> those two passes yourself. And the CUDA pass produces a Module that
>> is incompatible with the host Module. You cannot simply add it to the
>> JIT. I don't know what to do with it.
>>
>> And yes, I did watch Simeon's presentation, but he didn't get into
>> that level of detail (or if he did, I missed it). My impression is
>> that he actually uses nvcc to compile the CUDA kernels, not clang,
>> using his own parser to separate and adapt the source code...
>>
>> Thanks,
>> Geoff
>>
>>
>> Le dim. 22 nov. 2020 à 01:03, Stefan Gränitz
>> <stefan.graenitz at gmail.com <mailto:stefan.graenitz at gmail.com>> a écrit :
>>
>> Hi Geoff
>>
>> It looks like clang does that altogether:
>> https://llvm.org/docs/CompileCudaWithLLVM.html
>>
>> And, probably related: CUDA support has been added to Cling and
>> there was a presentation for it at the last Dev Meeting
>> https://www.youtube.com/watch?v=XjjZRhiFDVs
>>
>> Best,
>> Stefan
>>
>> On 20/11/2020 12:09, Geoff Levner via llvm-dev wrote:
>>> Thanks for that, Valentin.
>>>
>>> To be sure I understand what you are saying... Assume we are
>>> talking about a single .cu file containing both a C++ function
>>> and a CUDA kernel that it invokes, using <<<>>> syntax. Are you
>>> suggesting that we bypass clang altogether and use the Nvidia
>>> API to compile and install the CUDA kernel? If we do that, how
>>> will the JIT-compiled C++ function find the kernel?
>>>
>>> Geoff
>>>
>>> On Thu, Nov 19, 2020 at 6:34 PM Valentin Churavy
>>> <v.churavy at gmail.com <mailto:v.churavy at gmail.com>> wrote:
>>>
>>> Sound right now like you are emitting an LLVM module?
>>> The best strategy is probably to use to emit a PTX module
>>> and then pass that to the CUDA driver. This is what we do on
>>> the Julia side in CUDA.jl.
>>>
>>> Nvidia has a somewhat helpful tutorial on this at
>>> https://github.com/NVIDIA/cuda-samples/blob/c4e2869a2becb4b6d9ce5f64914406bf5e239662/Samples/vectorAdd_nvrtc/vectorAdd.cpp
>>> and
>>> https://github.com/NVIDIA/cuda-samples/blob/c4e2869a2becb4b6d9ce5f64914406bf5e239662/Samples/simpleDrvRuntime/simpleDrvRuntime.cpp
>>>
>>> Hope that helps.
>>> -V
>>>
>>>
>>> On Thu, Nov 19, 2020 at 12:11 PM Geoff Levner via llvm-dev
>>> <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>>
>>> wrote:
>>>
>>> I have made a bit of progress... When compiling CUDA
>>> source code in memory, the Compilation instance returned
>>> by Driver::BuildCompilation() contains two clang
>>> Commands: one for the host and one for the CUDA device.
>>> I can execute both commands using EmitLLVMOnlyActions. I
>>> add the Module from the host compilation to my JIT as
>>> usual, but... what to do with the Module from the device
>>> compilation? If I just add it to the JIT, I get an error
>>> message like this:
>>>
>>> Added modules have incompatible data layouts:
>>> e-i64:64-i128:128-v16:16-v32:32-n16:32:64 (module) vs
>>> e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128
>>> (jit)
>>>
>>> Any suggestions as to what to do with the Module
>>> containing CUDA kernel code, so that the host Module can
>>> invoke it?
>>>
>>> Geoff
>>>
>>> On Tue, Nov 17, 2020 at 6:39 PM Geoff Levner
>>> <glevner at gmail.com <mailto:glevner at gmail.com>> wrote:
>>>
>>> We have an application that allows the user to
>>> compile and execute C++ code on the fly, using Orc
>>> JIT v2, via the LLJIT class. And we would like to
>>> extend it to allow the user to provide CUDA source
>>> code as well, for GPU programming. But I am having a
>>> hard time figuring out how to do it.
>>>
>>> To JIT compile C++ code, we do basically as follows:
>>>
>>> 1. call Driver::BuildCompilation(), which returns a
>>> clang Command to execute
>>> 2. create a CompilerInvocation using the arguments
>>> from the Command
>>> 3. create a CompilerInstance around the
>>> CompilerInvocation
>>> 4. use the CompilerInstance to execute an
>>> EmitLLVMOnlyAction
>>> 5. retrieve the resulting Module from the action and
>>> add it to the JIT
>>>
>>> But to compile C++ requires only a single clang
>>> command. When you add CUDA to the equation, you add
>>> several other steps. If you use the clang front end
>>> to compile, clang does the following:
>>>
>>> 1. compiles the driver source code
>>> 2. compiles the resulting PTX code using the CUDA
>>> ptxas command
>>> 3. builds a "fat binary" using the CUDA fatbinary
>>> command
>>> 4. compiles the host source code and links in the
>>> fat binary
>>>
>>> So my question is: how do we replicate that process
>>> in memory, to generate modules that we can add to
>>> our JIT?
>>>
>>> I am no CUDA expert, and not much of a clang expert
>>> either, so if anyone out there can point me in the
>>> right direction, I would be grateful.
>>>
>>> Geoff
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>> --
>> https://flowcrypt.com/pub/stefan.graenitz@gmail.com
>>
> --
> https://flowcrypt.com/pub/stefan.graenitz@gmail.com
Cheers,
Simeon
--
Simeon Ehrig
Institut für Strahlenphysik
Helmholtz-Zentrum Dresden - Rossendorf e.V. (HZDR)
Bautzner Landstr. 400 | 01328 Dresden | Deutschland
Tel: +49 (0) 351 260 2974
http://www.hzdr.de
Vorstand: Prof. Dr. Sebastian M. Schmidt, Dr. Heike Wolke
Vereinsregister: VR 1693 beim Amtsgericht Dresden
Simeon Ehrig
Institute of Radiation Physics
Helmholtz-Zentrum Dresden - Rossendorf (HZDR)
Bautzner Landstr. 400 | 01328 Dresden | Germany
Phone: +49 351 260 2974
http://www.hzdr.de
Board of Directors: Prof. Dr. Sebastian M. Schmidt, Dr. Heike Wolke
Company Registration Number VR 1693, Amtsgericht Dresden
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20201123/cbc2da74/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5366 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20201123/cbc2da74/attachment-0001.bin>
More information about the llvm-dev
mailing list