[llvm-dev] ORC JIT Weekly #34 -- ORC Runtime and JITLink improvements, and a performance question

Sun May 16 22:53:44 PDT 2021

Hi All,

Just a few minor updates this week:

- Initial ORC runtime unit testing infrastructure has landed. Now that all
the basic infrastructure is in place I plan to start rolling out
implementation patches next week.

- ObjectLinkingLayer acquired support for JITLink LinkGraphs as first-class
input (on a par with object files).

- LinkGraph debug dumping has been improved.

An interesting question came up in discussion with @Xexizy in #jit on the
llvm discord server: Given that we're re-using the static compiler and
trying to match the native code model, how and where would we expect JIT'd
code performance to differ from AOT code performance. Leaving aside
compilation and linking overhead and focusing on performance of JIT'd code
once it's in memory, here are a few quick observations / thoughts:

* Feedback on JIT'd code performance has been sparse. Performance has been
good enough for my use cases so far, so I haven't gotten around
to measuring it systematically. It would be cool to build some ORC JIT
benchmarks, but exactly what those benchmarks should look like is not clear
yet.

* Memory layout: We make no attempt to lay memory out in a way that is
friendly to the memory system, though the user has some control over that
through their choice of memory manager implementation. The cost of
poor memory layout will vary from system to system. In theory JITLink
should give us enough information and flexibility to re-layout function
bodies in memory (using some sort of measurement/analysis to determine a
favorable layout). Nobody has actually tried this yet to my knowledge.

* Indirect access: RuntimeDyld usually uses indirect access through
registers for functions and globals. On some platforms this may be quite
inefficient. JITLink uses direct calls, synthesizes jump stubs for external
call targets only, and uses global offset tables to access data. Built-in
JITLink optimizations opportunistically bypass the jump stubs and GOT loads
whenever the target ends up being in range. The resulting linked code
should be nearly identical to the ahead-of-time compiled versions.

* Laziness: Lazy compilation in ORCv1 always used pointer stubs, and this
is the default behavior in ORCv2 too. JITLink allows us to identify call
sites, which we could use to rewrite calls (security model permitting) to
bypass the stubs after function bodies are lazily compiled.

* Thread local variables: The JIT only supports emulated thread local
variables at the moment (where it supports them at all). The ORC runtime
enables support for native thread locals on MachO, but the current
implementation isn't optimized -- performance still won't be as good as
pre-compiled TLVs. Future implementations should be able to reduce the cost
to something much closer to pre-compiled TLVs.

I'll keep thinking about this and add to this list if I come up with more.
If any of you have thoughts or insights on JIT'd code performance that
you'd like to share please jump in.

-- Lang.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210516/d2c29061/attachment.html>