<div dir="ltr"><div>I have made a bit of progress... When compiling CUDA source code in memory, the Compilation instance returned by Driver::BuildCompilation() contains two clang Commands: one for the host and one for the CUDA device. I can execute both commands using EmitLLVMOnlyActions. I add the Module from the host compilation to my JIT as usual, but... what to do with the Module from the device compilation? If I just add it to the JIT, I get an error message like this:</div><div><br></div><div> Added modules have incompatible data layouts: e-i64:64-i128:128-v16:16-v32:32-n16:32:64 (module) vs e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128 (jit)</div><div><br></div><div>Any suggestions as to what to do with the Module containing CUDA kernel code, so that the host Module can invoke it?</div><div><br></div><div>Geoff<br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Nov 17, 2020 at 6:39 PM Geoff Levner <<a href="mailto:glevner@gmail.com">glevner@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>We have an application that allows the user to compile and execute C++ code on the fly, using Orc JIT v2, via the LLJIT class. And we would like to extend it to allow the user to provide CUDA source code as well, for GPU programming. But I am having a hard time figuring out how to do it.</div><div><br></div><div>To JIT compile C++ code, we do basically as follows:</div><div><br></div><div>1. call Driver::BuildCompilation(), which returns a clang Command to execute</div><div>2. create a CompilerInvocation using the arguments from the Command</div><div>3. create a CompilerInstance around the CompilerInvocation</div><div>4. use the CompilerInstance to execute an EmitLLVMOnlyAction</div><div>5. retrieve the resulting Module from the action and add it to the JIT</div><div><br></div><div>But to compile C++ requires only a single clang command. When you add CUDA to the equation, you add several other steps. If you use the clang front end to compile, clang does the following:</div><div><br></div><div>1. compiles the driver source code<br></div><div>2. compiles the resulting PTX code using the CUDA ptxas command<br></div><div>3. builds a "fat binary" using the CUDA fatbinary command</div><div>4. compiles the host source code and links in the fat binary</div><div><br></div><div>So my question is: how do we replicate that process in memory, to generate modules that we can add to our JIT?</div><div><br></div><div>I am no CUDA expert, and not much of a clang expert either, so if anyone out there can point me in the right direction, I would be grateful.</div><div><br></div><div>Geoff</div><div><br></div></div>
</blockquote></div></div>