Hi Viktor,<br><br><div class="gmail_quote">On Tue, Apr 5, 2011 at 9:41 PM, Óscar Fuentes <span dir="ltr"><<a href="mailto:ofv@wanadoo.es">ofv@wanadoo.es</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">

<div class="im">Jim Grosbach <<a href="mailto:grosbach@apple.com">grosbach@apple.com</a>> writes:<br>

<br>

>> To me, increasing coverage of the FastISel seemed more involved than<br>

>> directly emitting opcodes to memory, with a lesser outlook on<br>

>> reducing overhead.<br>

><br>

> That seems extremely unlikely. You'd be effectively re-implementing<br>

> both fast-isel and the MC binary emitter layers, and it sounds like a<br>

> new register allocator as well.<br>

><br>

> What Eric is suggesting is instead locating which IR constructs are<br>

> not being handled by fast-isel and are causing problems (i.e., are<br>

> being frequently encountered in your code-base) and implementing<br>

> fast-isel handling for them. That will remove the selectiondag<br>

> overhead that you've identified as the primary compile-time problem.<br>

<br>

</div>At some point on the past someone was kind enough to add fast-isel for<br>

some instructions frequently emitted by my compiler, hoping that that<br>

would speed up JITting. The results were dissapointing (negligible,<br>

IIRC). Either fast-isel does not make much of a difference or the main<br>

inefficiency is elsewhere.<br>

<div><div></div><br></div></blockquote><div><br> fast-isel discussion aside, I think the real speed killer of a dynamic binary translator (or other users of the JIT which invoke it many times on small pieces of code) is the constant time of the JIT which is required for every source ISA BB (each BB gets mapped to an LLVM Function).<br>

<br>[1] cites a constant overhead of 10 ms per BB. I just did some simple measurements with callgrind doing an lli on a simple .ll file which only contains a main function which immediately returns. With -regalloc=fast and -fast-isel and an -O2 compiled lli we spend about 725000 instructions in getPointerToFunction(). Clearly, that's quite some constant overhead and I doubt that we can get it down by two orders of magnitude, so what about this:<br>

<br>The old qemu JIT used an extremely simple and fast approach which performed surprisingly well: Having chunks of precompiled machine code (from C sources) for the individual IR instructions which at runtime get glued together and patched as necessary.<br>

<br>The idea would be to use the same approach to generate machine code from LLVM IR, e.g. having chunks of LLVM MC instructions for the individual LLVM IR instructions (ideally describing the mapping with TableGen), glueing them together doing no dynamic register allocation, no scheduling.<br>

<br>I'd be willing to mentor such a project, let me know if you're interested.<br><br>Regards,<br><br>Tilmann<br><br><br>[1] <a href="http://www.iaeng.org/publication/IMECS2011/IMECS2011_pp212-216.pdf">http://www.iaeng.org/publication/IMECS2011/IMECS2011_pp212-216.pdf</a><br>

</div></div>