[llvm-dev] JIT compiling CUDA source code

Geoff Levner via llvm-dev llvm-dev at lists.llvm.org
Mon Nov 23 05:38:58 PST 2020


Now THAT answers my question. Thank you very much, Simeon! (I was hoping
there would be a simpler answer, of course...)

Geoff


On Mon, Nov 23, 2020 at 2:26 PM Simeon Ehrig <s.ehrig at hzdr.de> wrote:

> Hi,
> Let me give you a little overview of how the CUDA mode works in Cling.
>
> The workflow of the compiler pipeline is:
>
> 1. compile the device code to nvptx
> 2. wrap the nvptx code in fatbin
> 3. write the fatbin code to a specific file where the host's LLVM IR
> generator can find it
>   - it is more of a workaround because the API of the LLVM IR generator
> does not allow you to pass the fatbin code directly as a string or use a
> virtual file
> 4. generate the LLVM IR code of the host
>   - During generation, the fatbin code is integrated as a text segment
> into the llvm IR code of the hostt
>   - the LLVM IR gen generates some CUDA library calls depending on the
> device code, e.g. registration of a kernel
>   - often it is done via the global init and deinit functions, which are
> executed before and after the main function - attention, here I had some
> problems in the past because I forgot to call them
>   - there was also a problem with non-unique init functions, see here:
> https://github.com/root-project/cling/pull/233/commits/818faeff0ed86c9730334634f6b58241eae1bf12
> 5. generate the x86 machine code
> 6. execute
>
> Here is a list of interesting code sections where I have solved the
> problems. The full compiler pipeline for the device code is full integrate
> into Cling and does not depend on the CUDA SDK.
>
> 1. We have two Cling::Interpreter instances. One for the host side and one
> for the device side. The Cling::Interpreter class is a central concept in
> Cling. [1]
>  - a Cling::Interpreter instance contains different components for
> compiling the source code to machine code
>  - in our case it is interesting that
>    - the Cling::Interpreter has a Cling::IncrementalParser, which contains
> the Clang::CompilerInstance that generates the LLVM IR module [2]
>    - the Clang::CompilerInstance is set up with the same arguments as the
> Clang driver [3][4]
>    - Cling::Interpreter does the machine code generation for the host side
>    - has an IncrementalCUDADeviceCompiler object which contains the
> Cling::Interpreter instance for the device code [5]
>    - the parse function first calls the parse function of the device
> before the host parses the code [6]
> 2. The IncrementalCUDADeviceCompiler executes all functions to compile the
> source code into the Fatbin and inject it into the LLVM IR generator of the
> host code.
>    - at first the LLVM IR code is generated with the
> Clang::CompilerInstance [7]
>    - after that, it is compiled to NVPTX
>      - I added my own machine code generator instead of changing the code
> generator of Cling::Interpreter - it is a workaround and needs to be
> refactored [8]
>    - than the code is wrapped in fatbin [9]
>     - Originally I used the tool from Nvida, but then Hal Finkel
> reimplemented the tool in llvm-project-cxxjit [10]
>    - the fatbin code is written to a file where the CodeGen of the host
> can find it [11][12][13]
>
> [1]
> https://github.com/root-project/cling/blob/master/include/cling/Interpreter/Interpreter.h
> [2]
> https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/Interpreter.cpp#L732
> [3]
> https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/CIFactory.cpp#L1302
> [4]
> https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/IncrementalCUDADeviceCompiler.cpp#L55
> [5]
> https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/include/cling/Interpreter/Interpreter.h#L214
> [6]
> https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/Interpreter.cpp#L833
> [7]
> https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/IncrementalCUDADeviceCompiler.cpp#L267
> [8]
> https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/IncrementalCUDADeviceCompiler.cpp#L286
> [9]
> https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/IncrementalCUDADeviceCompiler.cpp#L321
> [10] https://github.com/hfinkel/llvm-project-cxxjit
> [11]
> https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/IncrementalCUDADeviceCompiler.cpp#L402
> [12]
> https://github.com/root-project/cling/blob/3d789b131ae6cb41686fb799f35f8f4760eb2cea/lib/Interpreter/Interpreter.cpp#L240
> [13]
> https://github.com/llvm/llvm-project/blob/101309fe048e66873cfd972c47c4b7e7f2b99f41/clang/lib/CodeGen/CGCUDANV.cpp#L534
> On 11/23/20 11:34 AM, Stefan Gränitz wrote:
>
> My impression is that he actually uses nvcc to compile the CUDA kernels,
> not clang
>
> The constructor here looks very much like the CUDA command line options
> are added to a clang::CompilerInstance, I might be wrong, but you could try
> to follow the trace and see where it ends up:
>
>
> https://github.com/root-project/cling/blob/master/lib/Interpreter/IncrementalCUDADeviceCompiler.cpp
>
> Disclaimer: I am not familiar with the details of Simeons work or cling or
> even with JITing CUDA :) Maybe Simeon can confirm or deny my guess.
>
> That is correct. I create a clang::CompilerInstance to compile nvptx.
>
>
>
> On 22/11/2020 09:09, Vassil Vassilev wrote:
>
> Adding Simeon in the loop for Cling and CUDA.
>
> Thanks, hi Simeon!
>
>
> On 22/11/2020 09:22, Geoff Levner wrote:
>
> Hi, Stefan.
>
> Yes, when compiling from the command line, clang does all the work for you
> transparently. But behind the scenes it performs two passes: one to compile
> source code for the host, and one to compile CUDA kernels.
>
> When compiling in memory, as far as I can tell, you have to perform those
> two passes yourself. And the CUDA pass produces a Module that is
> incompatible with the host Module. You cannot simply add it to the JIT. I
> don't know what to do with it.
>
> And yes, I did watch Simeon's presentation, but he didn't get into that
> level of detail (or if he did, I missed it). My impression is that he
> actually uses nvcc to compile the CUDA kernels, not clang, using his own
> parser to separate and adapt the source code...
>
> Thanks,
> Geoff
>
>
> Le dim. 22 nov. 2020 à 01:03, Stefan Gränitz <stefan.graenitz at gmail.com>
> a écrit :
>
>> Hi Geoff
>>
>> It looks like clang does that altogether:
>> https://llvm.org/docs/CompileCudaWithLLVM.html
>>
>> And, probably related: CUDA support has been added to Cling and there was
>> a presentation for it at the last Dev Meeting
>> https://www.youtube.com/watch?v=XjjZRhiFDVs
>>
>> Best,
>> Stefan
>>
>> On 20/11/2020 12:09, Geoff Levner via llvm-dev wrote:
>>
>> Thanks for that, Valentin.
>>
>> To be sure I understand what you are saying... Assume we are talking
>> about a single .cu file containing both a C++ function and a CUDA kernel
>> that it invokes, using <<<>>> syntax. Are you suggesting that we bypass
>> clang altogether and use the Nvidia API to compile and install the CUDA
>> kernel? If we do that, how will the JIT-compiled C++ function find the
>> kernel?
>>
>> Geoff
>>
>> On Thu, Nov 19, 2020 at 6:34 PM Valentin Churavy <v.churavy at gmail.com>
>> wrote:
>>
>>> Sound right now like you are emitting an LLVM module?
>>> The best strategy is probably to use to emit a PTX module and then pass
>>> that to the  CUDA driver. This is what we do on the Julia side in CUDA.jl.
>>>
>>> Nvidia has a somewhat helpful tutorial on this at
>>> https://github.com/NVIDIA/cuda-samples/blob/c4e2869a2becb4b6d9ce5f64914406bf5e239662/Samples/vectorAdd_nvrtc/vectorAdd.cpp
>>> and
>>> https://github.com/NVIDIA/cuda-samples/blob/c4e2869a2becb4b6d9ce5f64914406bf5e239662/Samples/simpleDrvRuntime/simpleDrvRuntime.cpp
>>>
>>> Hope that helps.
>>> -V
>>>
>>>
>>> On Thu, Nov 19, 2020 at 12:11 PM Geoff Levner via llvm-dev <
>>> llvm-dev at lists.llvm.org> wrote:
>>>
>>>> I have made a bit of progress... When compiling CUDA source code in
>>>> memory, the Compilation instance returned by Driver::BuildCompilation()
>>>> contains two clang Commands: one for the host and one for the CUDA device.
>>>> I can execute both commands using EmitLLVMOnlyActions. I add the Module
>>>> from the host compilation to my JIT as usual, but... what to do with the
>>>> Module from the device compilation? If I just add it to the JIT, I get an
>>>> error message like this:
>>>>
>>>>     Added modules have incompatible data layouts:
>>>> e-i64:64-i128:128-v16:16-v32:32-n16:32:64 (module) vs
>>>> e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128 (jit)
>>>>
>>>> Any suggestions as to what to do with the Module containing CUDA kernel
>>>> code, so that the host Module can invoke it?
>>>>
>>>> Geoff
>>>>
>>>> On Tue, Nov 17, 2020 at 6:39 PM Geoff Levner <glevner at gmail.com> wrote:
>>>>
>>>>> We have an application that allows the user to compile and execute C++
>>>>> code on the fly, using Orc JIT v2, via the LLJIT class. And we would like
>>>>> to extend it to allow the user to provide CUDA source code as well, for GPU
>>>>> programming. But I am having a hard time figuring out how to do it.
>>>>>
>>>>> To JIT compile C++ code, we do basically as follows:
>>>>>
>>>>> 1. call Driver::BuildCompilation(), which returns a clang Command to
>>>>> execute
>>>>> 2. create a CompilerInvocation using the arguments from the Command
>>>>> 3. create a CompilerInstance around the CompilerInvocation
>>>>> 4. use the CompilerInstance to execute an EmitLLVMOnlyAction
>>>>> 5. retrieve the resulting Module from the action and add it to the JIT
>>>>>
>>>>> But to compile C++ requires only a single clang command. When you add
>>>>> CUDA to the equation, you add several other steps. If you use the clang
>>>>> front end to compile, clang does the following:
>>>>>
>>>>> 1. compiles the driver source code
>>>>> 2. compiles the resulting PTX code using the CUDA ptxas command
>>>>> 3. builds a "fat binary" using the CUDA fatbinary command
>>>>> 4. compiles the host source code and links in the fat binary
>>>>>
>>>>> So my question is: how do we replicate that process in memory, to
>>>>> generate modules that we can add to our JIT?
>>>>>
>>>>> I am no CUDA expert, and not much of a clang expert either, so if
>>>>> anyone out there can point me in the right direction, I would be grateful.
>>>>>
>>>>> Geoff
>>>>>
>>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org
>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>
>>>
>> _______________________________________________
>> LLVM Developers mailing listllvm-dev at lists.llvm.orghttps://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>> -- https://flowcrypt.com/pub/stefan.graenitz@gmail.com
>>
>> -- https://flowcrypt.com/pub/stefan.graenitz@gmail.com
>
> Cheers,
> Simeon
>
> --
> Simeon Ehrig
> Institut für Strahlenphysik
> Helmholtz-Zentrum Dresden - Rossendorf e.V. (HZDR)
> Bautzner Landstr. 400 | 01328 Dresden | Deutschland
> Tel: +49 (0) 351 260 2974http://www.hzdr.de
> Vorstand: Prof. Dr. Sebastian M. Schmidt, Dr. Heike Wolke
> Vereinsregister: VR 1693 beim Amtsgericht Dresden
>
> Simeon Ehrig
> Institute of Radiation Physics
> Helmholtz-Zentrum Dresden - Rossendorf (HZDR)
> Bautzner Landstr. 400 | 01328 Dresden | Germany
> Phone: +49 351 260 2974http://www.hzdr.de
> Board of Directors: Prof. Dr. Sebastian M. Schmidt, Dr. Heike Wolke
> Company Registration Number VR 1693, Amtsgericht Dresden
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20201123/a045f87e/attachment.html>


More information about the llvm-dev mailing list