[llvm-dev] OrcJIT + CUDA Prototype for Cling

Tue Jan 16 05:41:46 PST 2018

Hi LLVM-Developers and Lang,

I solved the relocation problem and another problem, so I have a working
cuda-runtime-code-interpreter [1].

The solution of the relocation problem is, that I change die reloc mode
from dynamic relocation to PIC (position independent code).

The second problem was, that I got an cudaErrorInvalidDeviceFunction
error (error code 8), if I want to run a kernel. After some research, I
found out, that the kernel-code will be registered
(__cudaRegisterFatBinary(...) ) in a global constructor, which is
generated by the cuda backend (lib/CodeGen/CGCUDANV.cpp). The ctor
should start before the main function but I called the main directly. So
the error was caused by running directly the main function and skipped
the global cuda ctor and dtor. So I wrote a fix, which runs the ctor and
dtor before and after the main and all works fine.

Cheers,
Simeon

Am 14.11.2017 um 22:15 schrieb Simeon Ehrig:
>
> Hi Lang,
>
> thank You very much. I've used Your code and the creating of the
> object file works. I think the problem is after creating the object
> file. When I link the object file with ld I get an executable, which
> is working right.
>
> After changing the clang and llvm libraries from the package control
> version (.deb) to a own compiled version with debug options, I get an
> assert() fault.
> In
> void RuntimeDyldELF::resolveX86_64Relocation() at the case
> ELF::R_X86_64_PC32
> this will throw an assert. You can find the code in the file
> llvm/lib/ExecutionEngine/RuntimeDyld/RuntimeDyldELF.cpp . I don't know
> exactly, what this function do but after first research, I think it
> has something to do with the linking. Maybe You know more about the
> function?
>
> Your code also helps me to understand more, how the interpreter
> library works. I have also some new ideas, how I could find the
> concrete problem and solve it.
>
> Cheers,
> Simeon
>
> Am 09.11.2017 um 00:59 schrieb Lang Hames:
>> Hi Simon,
>>
>> I think the best thing would be to add an ObjectTransformLayer
>> between your CompileLayer and LinkingLayer so that you can capture
>> the object files as they're generated. Then you can inspect the
>> object files being generated by the compiler to see what might be
>> wrong with them.
>>
>> Something like this:
>>
>> class KaleidoscopeJIT {
>> private:
>>
>>   using ObjectPtr =
>> std::shared_ptr<object::OwningBinary<object::ObjectFile>>;
>>
>>   static ObjectPtr dumpObject(ObjectPtr Obj) {
>>     SmallVector<char, 256> UniqueObjFileName;
>>     sys::fs::createUniqueFile("jit-object-%%%.o", UniqueObjFileName);
>>     std::error_code EC;
>>     raw_fd_ostream ObjFileStream(UniqueObjFileName.data(), EC,
>> sys::fs::F_RW);
>>     ObjFileStream.write(Obj->getBinary()->getData().data(),          
>>                                                                      
>>                                                                      
>>                                                                      
>>                                                                      
>>                                       
>>                         Obj->getBinary()->getData().size());        
>>                                                                      
>>                                                                      
>>                                                                      
>>                                                                      
>>                                        
>>     return Obj;
>>   }
>>
>>   std::unique_ptr<TargetMachine> TM;
>>   const DataLayout DL;
>>   RTDyldObjectLinkingLayer ObjectLayer;
>>   ObjectTransformLayer<decltype(ObjectLayer),
>>                        decltype(&KaleidoscopeJIT::dumpObject)>
>> DumpObjectsLayer;
>>   IRCompileLayer<decltype(DumpObjectsLayer), SimpleCompiler>
>> CompileLayer;
>>
>> public:
>>   using ModuleHandle = decltype(CompileLayer)::ModuleHandleT;
>>
>>   KaleidoscopeJIT()
>>       : TM(EngineBuilder().selectTarget()), DL(TM->createDataLayout()),
>>         ObjectLayer([]() { return
>> std::make_shared<SectionMemoryManager>(); }),
>>         DumpObjectsLayer(ObjectLayer, &KaleidoscopeJIT::dumpObject),
>>         CompileLayer(DumpObjectsLayer, SimpleCompiler(*TM)) {
>>     llvm::sys::DynamicLibrary::LoadLibraryPermanently(nullptr);
>>   }
>>
>> Hope this helps!
>>
>> Cheers,
>> Lang.
>>
>>
>> On Wed, Sep 27, 2017 at 10:32 AM, Simeon Ehrig via llvm-dev
>> <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
>>
>>     Dear LLVM-Developers and Vinod Grover,
>>
>>     we are trying to extend the cling C++ interpreter
>>     (https://github.com/root-project/cling
>>     <https://github.com/root-project/cling>) with CUDA functionality
>>     for Nvidia GPUs.
>>
>>     I already developed a prototype based on OrcJIT and am seeking
>>     for feedback. I am currently a stuck with a runtime issue, on
>>     which my interpreter prototype fails to execute kernels with a
>>     CUDA runtime error.
>>
>>
>>     === How to use the prototype
>>
>>     This application interprets cuda runtime code. The program needs
>>     the whole cuda-program (.cu-file) and its pre-compiled device
>>     code (as fatbin) as an input:
>>
>>         command: cuda-interpreter [source].cu [kernels].fatbin
>>
>>     I also implemented an alternative mode, which is generating an
>>     object file. The object file can be linked (ld) to an exectuable.
>>     This mode is just implemented to check if the LLVM module
>>     generation works as expected. Activate it by changing the define
>>     INTERPRET from 1 to 0 .
>>
>>     === Implementation
>>
>>     The prototype is based on the clang example in
>>
>>     https://github.com/llvm-mirror/clang/tree/master/examples/clang-interpreter
>>     <https://github.com/llvm-mirror/clang/tree/master/examples/clang-interpreter>
>>
>>     I also pushed the source code to github with the install
>>     instructions and examples:
>>       https://github.com/SimeonEhrig/CUDA-Runtime-Interpreter
>>     <https://github.com/SimeonEhrig/CUDA-Runtime-Interpreter>
>>
>>     The device code generation can be performed with either clang's
>>     CUDA frontend or NVCC to ptx.
>>
>>     Here is the workflow in five stages:
>>
>>      1. generating ptx device code (a kind of nvidia assembler)
>>      2. translate ptx to sass (machine code of ptx)
>>      3. generate a fatbinray (a kind of wrapper for the device code)
>>      4. generate host code object file (use fatbinary as input)
>>      5. link to executable
>>
>>     (The exact commands are stored in the commands.txt in the github
>>     repo)
>>
>>     The interpreter replaces the 4th and 5th step. It interprets the
>>     host code with pre-compiled device code as fatbinary. The
>>     fatbinary (Step 1 to 3) will be generated with the clang compiler
>>     and the nvidia tools ptxas and fatbinary.
>>
>>     === Test Cases and Issues
>>
>>     You find the test sources on GitHub in the directory "example_prog".
>>
>>     Run the tests with cuda-interpeter and the two arguments as above:
>>
>>      [1] path to the source code in "example_prog"
>>          - note: even for host-only code, use the file-ending .cu
>>         
>>      [2] path to the runtime .fatbin
>>          - note: needs the file ending .fatbin
>>          - a fatbin file is necessary, but if the program doesn't
>>     need a kernel, the content of the file will ignore
>>
>>     Note: As a prototype, the input is just static and barely checked
>>     yet.
>>
>>     1. hello.cu <http://hello.cu>: simple c++ hello world program
>>     with cmath library call sqrt() -> works without problems
>>
>>     2. pthread_test.cu <http://pthread_test.cu>: c++ program, which
>>     starts a second thread -> works without problems
>>
>>     3. fat_memory.cu <http://fat_memory.cu>: use cuda library and
>>     allocate about 191 MB of VRAM. After the allocation, the program
>>     waits for 3 seconds, so you can check the memory usage with the
>>     nvidia-smi -> works without problems
>>
>>     4. runtime.cu <http://runtime.cu>: combine cuda library with a
>>     simple cuda kernel -> Generating an object file, which can be
>>     linked (see 5th call in commands above -> ld ...) to a working
>>     executable.
>>
>>     The last example has the following issues: Running the executable
>>     works fine. Interpreting the code instead does not work. The Cuda
>>     Runtime returns the error 8 (cudaErrorInvalidDeviceFunction) ,
>>     the kernel failed.
>>
>>     Do you have any idea how to proceed?
>>
>>
>>     Best regards,
>>     Simeon Ehrig
>>
>>     _______________________________________________
>>     LLVM Developers mailing list
>>     llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>>     http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>     <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>>
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180116/033b94c2/attachment.html>