[LLVMdev] Some MCJIT benchmark numbers

Mon Nov 18 18:53:31 PST 2013

So I finally took the plunge and switched to MCJIT (wasn't too bad, as long
as you remember to call InitializeNativeTargetDisassembler if you want
disassembly...), and I got the functionality to a point I was happy with so
I wanted to test perf of the system.  I created a simple benchmark and I'd
thought I'd share the results, both because I know I personally had no idea
what the results would be, and because it seems like there's some
low-hanging fruit to improve performance.

My JIT is currently structured as creating a new module per function it
wants to jit; I had experimented with using an approach where I had an
"incubator module" where all IR starts, and then on-demand extract it to
"compilation modules" when I want to send it to MCJIT, but my experience
was that this wasn't very helpful.  (My goal was to enable cross-function
optimizations such as inlining, but there's no easy way [and might not even
make sense] to run module-level optimizations on a single function.)

The benchmark I set up is a simple REPL loop, where the input is a
pre-parsed no-op statement.  I put this in a loop and measured the amount
of time it took, and tested it at 1k iterations and 10k iterations.  This
includes my IR-generation, but my expectation is that that should be
negligible compared to the MCJIT time (confirmed through profiling).  The
absolute numbers are from a Release build with asserts turned off (this
made a big difference), and the percentages are from a Release+Profiling
build.

For 1k iterations, the test took about 640ms on my desktop machine, ie
0.64ms per module.  Looking at the profiling results, it looks like about
47% of the time is spent in PassManagerImpl::run, and another 47% is spent
in addPassesToEmitMC, which feels like it could be avoided by doing that
just once.  Of the time spent in PassManagerImpl::run, about 35% is spent
in PassManager overhead such as initializeAnalysisImpl() /
removeNotPreservedAnalysis() / removeDeadPasses().

For 10k iterations, the test took about 12.6s, or 1.26ms per module, so
there's definitely some slowdown happening.  Looking at the profiling
output, it looks like the main difference is the appearance of
MCJIT::finalizeLoadedModules(), which ultimately calls
RuntimeDyldImpl::resolveRelocations() and
SectionMemoryManager::applyMemoryGroupPermissions(), both of which iterate
over all memory sections leading to quadratic overhead.  I'm not sure how
easy it would be, but it seems like there could be single-module variants
of these apis that could cut down on the overhead, since it looks like
MCJIT knows what modules need to be finalized but doesn't pass this
information to the dyld / memory manager.

My overall takeaway from these numbers is pretty good: they're good enough
for where my JIT is right now, and it seems like there's some
relatively-straightforward work that can be done to make them better.  I'm
curious what other people think.

Kevin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20131118/3f204a42/attachment.html>