[llvm-dev] [LLVMdev] Help with using LLVM to re-compile hot functions at run-time

Tue Sep 8 00:36:36 PDT 2015

Hi Lang,

Apologies if you receive multiple copies of this email.

After spending some time debugging Kaleidoscope orc fully_lazy toy example 
on
x86 I want to start implementing run-time optimizer as you suggested and 
again
I highly appreciate your help.
For now I'll defer the target specific implementation to the end after 
I'll have
the non target parts in place as I can run on x86 as a start.
Given a simple example of main function calling foo and bar functions;
IIUC I should start from the IR level of this module which means that
ParseIRFile will be be first called on the IR of the program, is that 
right?

I would like to make sure I understand your suggestion which is to insert 
a new
layer that should be implemented on top of the CompileCallbackLayer in 
order to
be able to call trigger_condition at the beginning of a function.
IIUC until the function (bar or foo) is optimized the call to foo and bar 
will
go through the resolver (foo and bar will not be compiled from scratch 
every
time we go through the resolver but rather execute the cached non 
optimized
version after first compiled). The resolver will check trigger_condition
to see if the cached non optimized version should be executed or a new
optimizied version should be compiled and executed.
After the trigger_condition is true foo and bar will be compiled to 
generate
their optimized version and this version will be executed directly from 
now on
(not going through the resolver any more). Is that right?
Does this layer on top of the CompileCallbackLayer should be similar to
class KaleidoscopeJIT?
I saw that in Kaleidoscope Orc's example the Lambda functions that are 
added in
createLambdaResolver are been executed by the resolver before compiling a 
call
so I assume that the trigger_condition should be added also by
createLambdaResolver so before compiling foo or bar the Lambda functions
that are added by calling createLambdaResolver and contain 
trigger_condition
will be executed, is that right?

IIUC in Kaleidoscope Orc's example the interpreter calls the addModule 
upon
parsing call expression in HandleTopLevelExpression.
In my case I assume addModule be called for the module returned from
ParseIRFile, right?
In this case should calling getAddress on the whole module (the IR of all
functions) will trigger calling the Lambda functions defined in
createLambdaResolver on foo and bar functions? Also - in Kaleidoscope orc
example the execution of the function is done explicitly in
HandleTopLevelExpression after calling getAddress and its not clear to me 
where
I should insert this in my case.

Thanks again,
Revital

Lang Hames <lhames at gmail.com> wrote on 28/07/2015 05:58:41 AM:

> From: Lang Hames <lhames at gmail.com>
> To: Revital1 Eres/Haifa/IBM at IBMIL
> Cc: LLVM Developers Mailing List <llvmdev at cs.uiuc.edu>
> Date: 28/07/2015 05:58 AM
> Subject: Re: [LLVMdev] Help with using LLVM to re-compile hot 
> functions at run-time
> 
> Hi Revital,
> 
> What do you mean by "code cache"? Orc (and MCJIT) does have the 
> concept of an ObjectCache, which is a long-lived, potentially 
> persistent, compiled version of some IR. It's not a key component of
> the JIT though: Most clients run without a cache attached and just 
> JIT their code from scratch in each session.
> 
> Recompilation is orthogonal to caching. There is no in-tree support 
> for recompilation yet. There are several ways that it could be 
> supported, depending on what security / performance trade-offs 
> you're willing to make, and how deep in to the LLVM code you want to
> get. As things stand at the moment all function calls in the lazy 
> JIT are indirected via function pointers. We want to add support for
> patchable call-sites, but this hasn't been implemented yet. The 
> Indirect calls make recompilation reasonably easy: You could add a 
> transform layer on top of the CompileCallbackLayer which would 
> modify each function like this:
> 
> void foo$impl() {          void foo$impl() {
>   // foo body        ->      if (trigger_condition) {
> }                              auto fooOpt = jit_recompile_hot(&foo);
>                                fooOpt();
>                              }
>                              // foo body
>                            }
> 
> You would implement the jit_recompile_hot function yourself in your 
> JIT and make it available to JIT'd code via the SymbolResolver. When
> the trigger condition is met you'll get a call to recompile foo, at 
> which point you: (1) Add the IR for foo to a 2nd IRCompileLayer that
> has been configured with a higher optimization level, (2) look up 
> the address of the optimized version of foo, and (3) update the 
> function pointer for foo to point at the optimized version. The 
> process for patchable callsites should be fairly similar once 
> they're available, except that you'll trigger a call-site update 
> rather than rewriting a function pointer.
> 
> This neglects all sorts of fun details (threading, garbage 
> collection of old function implementations), but hopefully it gives 
> you a place to start. 
> 
> Regarding laziness, as Hal mentioned you'll have to provide some 
> target support for PowerPC to support lazy compilation. For a rough 
> guide you can check out the X86_64 support code in llvm/include/
> llvm/ExecutionEngine/Orc/OrcTargetSupport.h and llvm/lib/
> ExecutionEngine/Orc/OrcTargetSupport.cpp.
> 
> There are two methods that you'll need to implement: 
> insertCompileCallbackTrampoline and insertResolverBlock. These work 
> together to enable lazy compilation. Both of these methods inject 
> blobs of target specific code in to the JIT process. To do this (at 
> least for now) I make use of a handy feature of LLVM IR: You can 
> write raw assembly code directly into a bitcode module ("module-
> level asm"). If you look at the X86 implementation of each of these 
> methods you'll see they're written in terms of string-streams 
> building up a string of assembly which will be handed off to the JIT
> to compile like any other code.
> 
> The first blob that you need to be able to output is the resolver 
> block. The purpose of the resolver block is to save program state 
> and call back in to the JIT to trigger lazy compilation of a 
> function. When the JIT is done compiling the function it returns the
> address of the compiled function to the resolver block, and the 
> resolver block returns to the compiled function (rather than its 
> original return address).
> 
> Because all functions share the same resolver block, the JIT needs 
> some way to distinguish them, which is where the trampolines come 
> in. The JIT emits one trampoline per function and each trampoline 
> just calls the resolver block. The return address of the call in 
> each trampoline provides the unique address that the JIT associates 
> with the to-be-compiled functions. The CompileCallbackManager 
> manages this association between trampolines and functions for you, 
> you just need to provide the resolver/trampoline primitives.
> 
> In case it helps, here's what the output of all this looks like on 
> X86. Trampolines are trivial - they're emitted in blocks and 
> proceeded by a pointer to the resolver block:
> 
> module asm "Lorc_resolve_block_addr:"
> module asm "  .quad 140439143575560"
> module asm "orc_jcc_0:"
> module asm "  callq *Lorc_resolve_block_addr(%rip)"
> module asm "orc_jcc_1:"
> module asm "  callq *Lorc_resolve_block_addr(%rip)"
> module asm "orc_jcc_2:"
> module asm "  callq *Lorc_resolve_block_addr(%rip)"
> ...
> 
> The resolver block is more complicated and I won't provide the full 
> code for it here. You can find it by running:
> 
> lli -jit-kind=orc-lazy -orc-lazy-debug=mods-to-stderr <hello_world.ll>
> 
> and looking at the initial output. In pseudo-asm though, it looks like 
this:
> 
> module asm "jit_callback_manager_addr:"
> module asm "  .quad 0x46fc190" // <- address of callback manager object
> module asm "orc_resolver_block:"
> module asm "  // save register state."
> module asm "  // load jit_callback_manager_addr into %rdi
> module asm "  // load the return address (from the trampoline call) into 
%rsi
> module asm "  // %rax = call jit(%rdi, %rsi)
> module asm "  // save %rax over the return address
> module asm "  //  restore register state
> module asm "  //  retq"
> 
> So, that's a whirlwind intro to implementing lazy JITing support for
> a new architecture in Orc. I'll try to answer any questions you have
> on the topic, though I'm not familiar with PowerPC at all. If you're
> comfortable with PowerPC assembly I think it should be possible to 
> implement once you grok the concepts.
> 
> Hope this helps!
> 
> Cheers,
> Lang.
> 
> On Jul 26, 2015, at 11:17 PM, Revital1 Eres <ERES at il.ibm.com> wrote:
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150908/7be104e7/attachment-0001.html>