[llvm-dev] [LLVMdev] Help with using LLVM to re-compile hot functions at run-time

Thu Sep 17 23:47:49 PDT 2015

Hi Revital,

Attached is a new version of the fully_lazy Orc Kaleidoscope demo that has
been extended to enable re-compilation at higher optimisation levels,
roughly following the scheme I outlined before.

In the compile action for the callback, the initial IR for each is
transformed like this:

                           unsigned foo_counter = 0;
void foo$impl() {          void foo$impl() {
  // foo body        ->      if (++foo_counter > 1000) {
}                              auto fooOpt = $recompile(&foo);
                               fooOpt();
                             }
                             // foo body
                           }

The key changes to make this work (which you can see by diff'ing toy.cpp
against the original fully_lazy version):

1) New layers HotCompileLayer and HotIROptsLayer added. These perform IR
optimisation and code generation at higher optimisation levels than the
default layers.
2) The symbol resolver function (not to be confused with the resolver
block) has been pulled out into its own function, createResolver, so that
it can be shared between optimised & non-optimized code. It also resolves
the "$recompile" function to a static method on the KaleidoscopeJIT class
itself.
3) The lazy compile action now calls 'instrumentFunctions' before adding
the IR for cold functions to the JIT.
4) The instrumentFunctions method injects the counter code and call to
recompile.
5) The recompileHot method re-IRGens functions, then adds them to the
HotIROpts layer to generate more optimized versions. It then updates the
function-body pointer so that subsequent calls go to the optimised version.

This is a bit quick-and-dirty, but does work. In the future I'll try to
tidy this up and turn it into a new tutorial chapter.

Hope this helps!

Cheers,
Lang.

On Wed, Sep 16, 2015 at 10:09 PM, Revital1 Eres <ERES at il.ibm.com> wrote:

> Hi Lang,
>
> Many thanks!!! I just wanted to make sure you did not miss it...
>
> Thanks again!
> Revital
>
>
>
> From:        Lang Hames <lhames at gmail.com>
> To:        Revital1 Eres/Haifa/IBM at IBMIL
> Cc:        LLVM Developers Mailing List <llvmdev at cs.uiuc.edu>
> Date:        17/09/2015 01:56 AM
> Subject:        Re: [LLVMdev] Help with using LLVM to re-compile hot
> functions at run-time
> ------------------------------
>
>
>
> Hi Revital,
>
> Apologies for the delayed reply.
>
> I'm working on some example code for how to do this. I'll try to post it
> tomorrow.
>
> Cheers,
> Lang.
>
> On Tue, Sep 8, 2015 at 12:23 AM, Revital1 Eres <*ERES at il.ibm.com*
> <ERES at il.ibm.com>> wrote:
> Hi Lang,
>
> After spending some time debugging Kaleidoscope orc fully_lazy toy example
> on
> x86 I want to start implementing run-time optimizer as you suggested and
> again
> I highly appreciate your help.
> For now I'll defer the target specific implementation to the end after
> I'll have
> the non target parts in place as I can run on x86 as a start.
> Given a simple example of main function calling foo and bar functions;
> IIUC I should start from the IR level of this module which means that
> ParseIRFile will be be first called on the IR of the program, is that
> right?
>
> I would like to make sure I understand your suggestion which is to insert
> a new
> layer that should be implemented on top of the CompileCallbackLayer in
> order to
> be able to call trigger_condition at the beginning of a function.
> IIUC until the function (bar or foo) is optimized the call to foo and bar
> will
> go through the resolver (foo and bar will not be compiled from scratch
> every
> time we go through the resolver but rather execute the cached non
> optimized
> version after first compiled). The resolver will check trigger_condition
> to see if the cached non optimized version should be executed or a new
> optimizied version should be compiled and executed.
> After the trigger_condition is true foo and bar will be compiled to
> generate
> their optimized version and this version will be executed directly from
> now on
> (not going through the resolver any more). Is that right?
> Does this layer on top of the CompileCallbackLayer should be similar to
> class KaleidoscopeJIT?
> I saw that in Kaleidoscope Orc's example the Lambda functions that are
> added in
> createLambdaResolver are been executed by the resolver before compiling a
> call
> so I assume that the trigger_condition should be added also by
> createLambdaResolver so before compiling foo or bar the Lambda functions
> that are added by calling createLambdaResolver and contain
> trigger_condition
> will be executed, is that right?
>
> IIUC in Kaleidoscope Orc's example the interpreter calls the addModule
> upon
> parsing call expression in HandleTopLevelExpression.
> In my case I assume addModule be called for the module returned from
> ParseIRFile, right?
> In this case should calling getAddress on the whole module (the IR of all
> functions) will trigger calling the Lambda functions defined in
> createLambdaResolver on foo and bar functions? Also - in Kaleidoscope orc
> example the execution of the function is done explicitly in
> HandleTopLevelExpression after calling getAddress and its not clear to me
> where
> I should insert this in my case.
>
> Thanks again,
> Revital
>
>
>
>
> From:        Lang Hames <*lhames at gmail.com* <lhames at gmail.com>>
> To:        Revital1 Eres/Haifa/IBM at IBMIL
> Cc:        LLVM Developers Mailing List <*llvmdev at cs.uiuc.edu*
> <llvmdev at cs.uiuc.edu>>
> Date:        28/07/2015 05:58 AM
> Subject:        Re: [LLVMdev] Help with using LLVM to re-compile hot
> functions at run-time
> ------------------------------
>
>
>
> Hi Revital,
>
> What do you mean by "code cache"? Orc (and MCJIT) does have the concept of
> an ObjectCache, which is a long-lived, potentially persistent, compiled
> version of some IR. It's not a key component of the JIT though: Most
> clients run without a cache attached and just JIT their code from scratch
> in each session.
>
> Recompilation is orthogonal to caching. There is no in-tree support for
> recompilation yet. There are several ways that it could be supported,
> depending on what security / performance trade-offs you're willing to make,
> and how deep in to the LLVM code you want to get. As things stand at the
> moment all function calls in the lazy JIT are indirected via function
> pointers. We want to add support for patchable call-sites, but this hasn't
> been implemented yet. The Indirect calls make recompilation reasonably
> easy: You could add a transform layer on top of the CompileCallbackLayer
> which would modify each function like this:
>
> void foo$impl() {          void foo$impl() {
>   // foo body        ->      if (trigger_condition) {
> }                              auto fooOpt = jit_recompile_hot(&foo);
>                                fooOpt();
>                              }
>                              // foo body
>                            }
>
> You would implement the jit_recompile_hot function yourself in your JIT
> and make it available to JIT'd code via the SymbolResolver. When the
> trigger condition is met you'll get a call to recompile foo, at which point
> you: (1) Add the IR for foo to a 2nd IRCompileLayer that has been
> configured with a higher optimization level, (2) look up the address of the
> optimized version of foo, and (3) update the function pointer for foo to
> point at the optimized version. The process for patchable callsites should
> be fairly similar once they're available, except that you'll trigger a
> call-site update rather than rewriting a function pointer.
>
> This neglects all sorts of fun details (threading, garbage collection of
> old function implementations), but hopefully it gives you a place to
> start.
>
>
> Regarding laziness, as Hal mentioned you'll have to provide some target
> support for PowerPC to support lazy compilation. For a rough guide you can
> check out the X86_64 support code in
> llvm/include/llvm/ExecutionEngine/Orc/OrcTargetSupport.h and
> llvm/lib/ExecutionEngine/Orc/OrcTargetSupport.cpp.
>
> There are two methods that you'll need to implement:
> insertCompileCallbackTrampoline and insertResolverBlock. These work
> together to enable lazy compilation. Both of these methods inject blobs of
> target specific code in to the JIT process. To do this (at least for now) I
> make use of a handy feature of LLVM IR: You can write raw assembly code
> directly into a bitcode module ("module-level asm"). If you look at the X86
> implementation of each of these methods you'll see they're written in terms
> of string-streams building up a string of assembly which will be handed off
> to the JIT to compile like any other code.
>
> The first blob that you need to be able to output is the resolver block.
> The purpose of the resolver block is to save program state and call back in
> to the JIT to trigger lazy compilation of a function. When the JIT is done
> compiling the function it returns the address of the compiled function to
> the resolver block, and the resolver block returns to the compiled function
> (rather than its original return address).
>
> Because all functions share the same resolver block, the JIT needs some
> way to distinguish them, which is where the trampolines come in. The JIT
> emits one trampoline per function and each trampoline just calls the
> resolver block. The return address of the call in each trampoline provides
> the unique address that the JIT associates with the to-be-compiled
> functions. The CompileCallbackManager manages this association between
> trampolines and functions for you, you just need to provide the
> resolver/trampoline primitives.
>
> In case it helps, here's what the output of all this looks like on X86.
> Trampolines are trivial - they're emitted in blocks and proceeded by a
> pointer to the resolver block:
>
> module asm "Lorc_resolve_block_addr:"
> module asm "  .quad 140439143575560"
> module asm "orc_jcc_0:"
> module asm "  callq *Lorc_resolve_block_addr(%rip)"
> module asm "orc_jcc_1:"
> module asm "  callq *Lorc_resolve_block_addr(%rip)"
> module asm "orc_jcc_2:"
> module asm "  callq *Lorc_resolve_block_addr(%rip)"
> ...
>
>
> The resolver block is more complicated and I won't provide the full code
> for it here. You can find it by running:
> lli -jit-kind=orc-lazy -orc-lazy-debug=mods-to-stderr <hello_world.ll>
>
>
>
> and looking at the initial output. In pseudo-asm though, it looks like
> this:
>
> module asm "jit_callback_manager_addr:"
> module asm "  .quad 0x46fc190" // <- address of callback manager object
> module asm "orc_resolver_block:"
> module asm "  // save register state."
> module asm "  // load jit_callback_manager_addr into %rdi
> module asm "  // load the return address (from the trampoline call) into
> %rsi
> module asm "  // %rax = call jit(%rdi, %rsi)
> module asm "  // save %rax over the return address
> module asm "  //  restore register state
> module asm "  //  retq"
>
> So, that's a whirlwind intro to implementing lazy JITing support for a new
> architecture in Orc. I'll try to answer any questions you have on the
> topic, though I'm not familiar with PowerPC at all. If you're comfortable
> with PowerPC assembly I think it should be possible to implement once you
> grok the concepts.
>
> Hope this helps!
>
> Cheers,
> Lang.
>
>
> On Jul 26, 2015, at 11:17 PM, Revital1 Eres <*ERES at il.ibm.com*
> <ERES at il.ibm.com>> wrote:
>
> Hi Again,
>
> I'm a little confused regarding what is the exact Orc's functions I should
> use
> in order to save the functions code in a code cache so it could be later
> replaced with different versions of it and I appreciate your help.
>
> Just a reminder I want to dynamically recompile the program based on
> profile
>  collected at the run-time. I would like to start executing the program
> from
> the code-cache and at some point be able to replace a function body with
> it's
> new compiled version; this can be done by replacing the entry in the
> function
>  code with a trampoline to It's new version so that future calls to it will
> call the new version code.
>
> Does the CompileOnDemandLayer executes the program from a code cache
> and holds pointers to the code of the functions it executes? I am
> compiling for Power machine.
> Is there a target specific pieces that I should implement for making Orc
> work on Power?
>
> Thanks again,
> Revital
>
>
>
>
> From:        Lang Hames <*lhames at gmail.com* <lhames at gmail.com>>
> To:        Revital1 Eres/Haifa/IBM at IBMIL
> Cc:        LLVM Developers Mailing List <*llvmdev at cs.uiuc.edu*
> <llvmdev at cs.uiuc.edu>>
> Date:        20/07/2015 08:41 PM
> Subject:        Re: [LLVMdev] Help with using LLVM to re-compile hot
> functions at run-time
> ------------------------------
>
>
>
> Hi Revital,
>
> The CompileOnDemand layer is used by the lazy bitcode JIT in the lli tool.
> You can find the code in llvm/tools/lli/OrcLazyJIT.* .
>
> Cheers,
> Lang.
>
>
> On Mon, Jul 20, 2015 at 2:32 AM, Revital1 Eres <*ERES at il.ibm.com*
> <ERES at il.ibm.com>> wrote:
> Hello Lang,
>
> Thanks for your answer.
>
> I am now looking for an example of the usage of CompileOnDemandLayer. Is
> there an example available for that (could not find one in llvm/examples)?
>
> Thanks,
> Revital
>
>
>
> From:        Lang Hames <*lhames at gmail.com* <lhames at gmail.com>>
> To:        Revital1 Eres/Haifa/IBM at IBMIL
> Cc:        LLVM Developers Mailing List <*llvmdev at cs.uiuc.edu*
> <llvmdev at cs.uiuc.edu>>
> Date:        10/07/2015 12:10 AM
> Subject:        Re: [LLVMdev] Help with using LLVM to re-compile hot
> functions at run-time
> ------------------------------
>
>
>
> Hi Revital,
>
> LLVM does have an IR interpreter, but I don't think it's maintained well
> (or possibly at all). The interpreter is also not designed to interact with
> the LLVM JITs.
>
> We generally encourage people to just JIT LLVM IR, rather than
> interpreting it. For the use-case you have described, you could JIT IR with
> no optimizations to begin with, then re-JIT hot functions at a higher
> level.
>
> The Orc JIT APIs (LLVM's newer JIT APIs) were written with this kind of
> use-case in mind, and are probably a better fit for this than MCJIT. There
> is no built-in hot-function detection or recompilation yet, but I think
> this would be *fairly* easy to write in terms of Orc's callback API.
>
> Cheers,
> Lang.
>
>
> On Thu, Jul 9, 2015 at 4:19 AM, Revital1 Eres <*ERES at il.ibm.com*
> <ERES at il.ibm.com>> wrote:
> Hello,
>
> I am new to LLVM and a I appreciate your help with the following:
>
> I want to run the LLVM IR through virtual machine (LLVM interpreter?) and
> jit
> compile the hot functions (using MCJIT).
>
> This task will require amongst other identifying the hot functions and
> having a
> code cache that should be patched with the native code of the functions
> after
> they are jitted.
>
> I've read so far about MCJIT and lli however I have not seen that the LLVM
> interpreter can be used as a VM the way I was looking for; meaning
> execute the code one instruction at a time; have a profiling mode to
> identify hot functions and call jit to compile the hot functions.
>
> I appreciate any advice/starting points for this project.
>
> Thanks,
> Revital
>
> _______________________________________________
> LLVM Developers mailing list
> *LLVMdev at cs.uiuc.edu* <LLVMdev at cs.uiuc.edu>
> *http://llvm.cs.uiuc.edu* <http://llvm.cs.uiuc.edu/>
> *http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev*
> <http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev>
>
>
>
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150917/63f9f733/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: fully_lazy_with_recompile.tgz
Type: application/x-gzip
Size: 27632 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150917/63f9f733/attachment-0001.bin>