<font size=3>Hi Lang,</font>

<br>

<br><font size=2 face="Calibri">Apologies if you receive multiple copies</font><font size=3>

of this email.</font>

<br>

<br><font size=3>After spending some time debugging Kaleidoscope orc fully_lazy

toy example on</font>

<br><font size=3>x86 I want to start implementing run-time optimizer as

you suggested and again</font>

<br><font size=3>I highly appreciate your help.</font>

<br><font size=3>For now I'll defer the target specific implementation

to the end after I'll have</font>

<br><font size=3>the non target parts in place as I can run on x86 as a

start.</font>

<br><font size=3>Given a simple example of main function calling foo and

bar functions;</font>

<br><font size=3>IIUC I should start from the IR level of this module which

means that</font>

<br><font size=3>ParseIRFile will be be first called on the IR of the program,

is that right?</font>

<br>

<br><font size=3>I would like to make sure I understand your suggestion

which is to insert a new</font>

<br><font size=3>layer that should be implemented on top of the CompileCallbackLayer

in order to</font>

<br><font size=3>be able to call trigger_condition at the beginning of

a function.</font>

<br><font size=3>IIUC until the function (bar or foo) is optimized the

call to foo and bar will</font>

<br><font size=3>go through the resolver (foo and bar will not be compiled

from scratch every</font>

<br><font size=3>time we go through the resolver but rather execute the

cached non optimized</font>

<br><font size=3>version after first compiled). The resolver will check

trigger_condition</font>

<br><font size=3>to see if the cached non optimized version should be executed

or a new</font>

<br><font size=3>optimizied version should be compiled and executed.</font>

<br><font size=3>After the trigger_condition is true foo and bar will be

compiled to generate</font>

<br><font size=3>their optimized version and this version will be executed

directly from now on</font>

<br><font size=3>(not going through the resolver any more). Is that right?</font>

<br><font size=3>Does this layer on top of the CompileCallbackLayer should

be similar to</font>

<br><font size=3>class KaleidoscopeJIT?</font>

<br><font size=3>I saw that in Kaleidoscope Orc's example the Lambda functions

that are added in</font>

<br><font size=3>createLambdaResolver are been executed by the resolver

before compiling a call</font>

<br><font size=3>so I assume that the trigger_condition should be added

also by</font>

<br><font size=3>createLambdaResolver so before compiling foo or bar the

Lambda functions</font>

<br><font size=3>that are added by calling createLambdaResolver and contain

trigger_condition</font>

<br><font size=3>will be executed, is that right?</font>

<br>

<br><font size=3>IIUC in Kaleidoscope Orc's example the interpreter calls

the addModule upon</font>

<br><font size=3>parsing call expression in HandleTopLevelExpression.</font>

<br><font size=3>In my case I assume addModule be called for the module

returned from</font>

<br><font size=3>ParseIRFile, right?</font>

<br><font size=3>In this case should calling getAddress on the whole module

(the IR of all</font>

<br><font size=3>functions) will trigger calling the Lambda functions defined

in</font>

<br><font size=3>createLambdaResolver on foo and bar functions? Also -

in Kaleidoscope orc</font>

<br><font size=3>example the execution of the function is done explicitly

in</font>

<br><font size=3>HandleTopLevelExpression after calling getAddress and

its not clear to me where</font>

<br><font size=3>I should insert this in my case.</font>

<br>

<br><font size=3>Thanks again,</font>

<br><font size=3>Revital</font>

<br>

<br>

<br><tt><font size=2>Lang Hames <lhames@gmail.com> wrote on 28/07/2015

05:58:41 AM:<br>

<br>

> From: Lang Hames <lhames@gmail.com></font></tt>

<br><tt><font size=2>> To: Revital1 Eres/Haifa/IBM@IBMIL</font></tt>

<br><tt><font size=2>> Cc: LLVM Developers Mailing List <llvmdev@cs.uiuc.edu></font></tt>

<br><tt><font size=2>> Date: 28/07/2015 05:58 AM</font></tt>

<br><tt><font size=2>> Subject: Re: [LLVMdev] Help with using LLVM to

re-compile hot <br>

> functions at run-time</font></tt>

<br><tt><font size=2>> <br>

> Hi Revital,</font></tt>

<br><tt><font size=2>> <br>

> What do you mean by "code cache"? Orc (and MCJIT) does have

the <br>

> concept of an ObjectCache, which is a long-lived, potentially <br>

> persistent, compiled version of some IR. It's not a key component

of<br>

> the JIT though: Most clients run without a cache attached and just

<br>

> JIT their code from scratch in each session.</font></tt>

<br><tt><font size=2>> <br>

> Recompilation is orthogonal to caching. There is no in-tree support

<br>

> for recompilation yet. There are several ways that it could be <br>

> supported, depending on what security / performance trade-offs <br>

> you're willing to make, and how deep in to the LLVM code you want

to<br>

> get. As things stand at the moment all function calls in the lazy

<br>

> JIT are indirected via function pointers. We want to add support for<br>

> patchable call-sites, but this hasn't been implemented yet. The <br>

> Indirect calls make recompilation reasonably easy: You could add a

<br>

> transform layer on top of the CompileCallbackLayer which would <br>

> modify each function like this:</font></tt>

<br><tt><font size=2>> <br>

> void foo$impl() {          void foo$impl()

{</font></tt>

<br><tt><font size=2>>   // foo body        ->

     if (trigger_condition) {</font></tt>

<br><tt><font size=2>> }              

               auto fooOpt = jit_recompile_hot(&foo);</font></tt>

<br><tt><font size=2>>              

                 fooOpt();</font></tt>

<br><tt><font size=2>>              

               }</font></tt>

<br><tt><font size=2>>              

               // foo body</font></tt>

<br><tt><font size=2>>              

             }</font></tt>

<br><tt><font size=2>> <br>

> You would implement the jit_recompile_hot function yourself in your

<br>

> JIT and make it available to JIT'd code via the SymbolResolver. When<br>

> the trigger condition is met you'll get a call to recompile foo, at

<br>

> which point you: (1) Add the IR for foo to a 2nd IRCompileLayer that<br>

> has been configured with a higher optimization level, (2) look up

<br>

> the address of the optimized version of foo, and (3) update the <br>

> function pointer for foo to point at the optimized version. The <br>

> process for patchable callsites should be fairly similar once <br>

> they're available, except that you'll trigger a call-site update <br>

> rather than rewriting a function pointer.</font></tt>

<br><tt><font size=2>> <br>

> This neglects all sorts of fun details (threading, garbage <br>

> collection of old function implementations), but hopefully it gives

<br>

> you a place to start. </font></tt>

<br><tt><font size=2>> <br>

> Regarding laziness, as Hal mentioned you'll have to provide some <br>

> target support for PowerPC to support lazy compilation. For a rough

<br>

> guide you can check out the X86_64 support code in llvm/include/<br>

> llvm/ExecutionEngine/Orc/OrcTargetSupport.h and llvm/lib/<br>

> ExecutionEngine/Orc/OrcTargetSupport.cpp.</font></tt>

<br><tt><font size=2>> <br>

> There are two methods that you'll need to implement: <br>

> insertCompileCallbackTrampoline and insertResolverBlock. These work

<br>

> together to enable lazy compilation. Both of these methods inject

<br>

> blobs of target specific code in to the JIT process. To do this (at

<br>

> least for now) I make use of a handy feature of LLVM IR: You can <br>

> write raw assembly code directly into a bitcode module ("module-<br>

> level asm"). If you look at the X86 implementation of each of

these <br>

> methods you'll see they're written in terms of string-streams <br>

> building up a string of assembly which will be handed off to the JIT<br>

> to compile like any other code.</font></tt>

<br><tt><font size=2>> <br>

> The first blob that you need to be able to output is the resolver

<br>

> block. The purpose of the resolver block is to save program state

<br>

> and call back in to the JIT to trigger lazy compilation of a <br>

> function. When the JIT is done compiling the function it returns the<br>

> address of the compiled function to the resolver block, and the <br>

> resolver block returns to the compiled function (rather than its <br>

> original return address).</font></tt>

<br><tt><font size=2>> <br>

> Because all functions share the same resolver block, the JIT needs

<br>

> some way to distinguish them, which is where the trampolines come

<br>

> in. The JIT emits one trampoline per function and each trampoline

<br>

> just calls the resolver block. The return address of the call in <br>

> each trampoline provides the unique address that the JIT associates

<br>

> with the to-be-compiled functions. The CompileCallbackManager <br>

> manages this association between trampolines and functions for you,

<br>

> you just need to provide the resolver/trampoline primitives.</font></tt>

<br><tt><font size=2>> <br>

> In case it helps, here's what the output of all this looks like on

<br>

> X86. Trampolines are trivial - they're emitted in blocks and <br>

> proceeded by a pointer to the resolver block:</font></tt>

<br><tt><font size=2>> <br>

> module asm "Lorc_resolve_block_addr:"</font></tt>

<br><tt><font size=2>> module asm "  .quad 140439143575560"</font></tt>

<br><tt><font size=2>> module asm "orc_jcc_0:"</font></tt>

<br><tt><font size=2>> module asm "  callq *Lorc_resolve_block_addr(%rip)"</font></tt>

<br><tt><font size=2>> module asm "orc_jcc_1:"</font></tt>

<br><tt><font size=2>> module asm "  callq *Lorc_resolve_block_addr(%rip)"</font></tt>

<br><tt><font size=2>> module asm "orc_jcc_2:"</font></tt>

<br><tt><font size=2>> module asm "  callq *Lorc_resolve_block_addr(%rip)"</font></tt>

<br><tt><font size=2>> ...</font></tt>

<br><tt><font size=2>> <br>

> The resolver block is more complicated and I won't provide the full

<br>

> code for it here. You can find it by running:</font></tt>

<br><tt><font size=2>> <br>

> lli -jit-kind=orc-lazy -orc-lazy-debug=mods-to-stderr <hello_world.ll></font></tt>

<br><tt><font size=2>> <br>

> and looking at the initial output. In pseudo-asm though, it looks

like this:</font></tt>

<br><tt><font size=2>> <br>

> module asm "jit_callback_manager_addr:"</font></tt>

<br><tt><font size=2>> module asm "  .quad 0x46fc190"

// <- address of callback manager object</font></tt>

<br><tt><font size=2>> module asm "orc_resolver_block:"</font></tt>

<br><tt><font size=2>> module asm "  // save register state."</font></tt>

<br><tt><font size=2>> module asm "  // load jit_callback_manager_addr

into %rdi</font></tt>

<br><tt><font size=2>> module asm "  // load the return address

(from the trampoline call) into %rsi</font></tt>

<br><tt><font size=2>> module asm "  // %rax = call jit(%rdi,

%rsi)</font></tt>

<br><tt><font size=2>> module asm "  // save %rax over the

return address</font></tt>

<br><tt><font size=2>> module asm "  //  restore register

state</font></tt>

<br><tt><font size=2>> module asm "  //  retq"</font></tt>

<br><tt><font size=2>> <br>

> So, that's a whirlwind intro to implementing lazy JITing support for<br>

> a new architecture in Orc. I'll try to answer any questions you have<br>

> on the topic, though I'm not familiar with PowerPC at all. If you're<br>

> comfortable with PowerPC assembly I think it should be possible to

<br>

> implement once you grok the concepts.</font></tt>

<br><tt><font size=2>> <br>

> Hope this helps!</font></tt>

<br><tt><font size=2>> <br>

> Cheers,</font></tt>

<br><tt><font size=2>> Lang.</font></tt>

<br><tt><font size=2>> <br>

> On Jul 26, 2015, at 11:17 PM, Revital1 Eres <ERES@il.ibm.com>

wrote:<br>

</font></tt>