[LLVMdev] Odd code layout requirements for MCJIT

Tue Nov 18 20:21:28 PST 2014

I'm part of a team working on adding an llvm codegen backend to HHVM 
(PHP JIT, http://hhvm.com) using MCJIT. We have a code layout problem 
and I'm looking for opinions on good ways to solve it.

The short version is that the memory we emit code into is split into a 
few different areas, and we'd like a way to control which area each 
BasicBlock ends up in during codegen. I know this probably sounds pretty 
odd, so here's a much more detailed explanation:

Our translation unit is a "tracelet" which is typically a single basic 
block of PHP code. We emit code one tracelet at a time as it's needed 
for execution, so the physical layout of the generated code looks like 
the execution path taken the first time the code is run (we're actually 
a little smarter than that now, but this description is good enough for 
the problem at hand).

We're under pretty severe icache/iTLB pressure, so we do whatever we can 
to keep the hot path as compact as possible. One of the ways we do this 
is by dividing our code cache into three fixed-size areas: main, cold, 
and frozen. Our current, non-llvm codegen backend has one area tag per 
basic block, and most tracelets we compile will span all three areas. 
This means that if we emit code for tracelets A and then B, A's main 
code will be followed immediately by B's main code (and the same for 
cold/frozen). It ends up looking something like this:

good layout: | A_main B_main <unused> | A_cold B_cold <unused> |
                A_frozen... |

Main is the primary execution path. Cold contains code that we expect to 
happen rarely. Conditional branches from main to cold are generally 
taken <5% of the time. Frozen contains code that we expect to almost 
never run, but must have for correctness (mostly for handling 
exceptions). Conditional branches from main to frozen are generally 
taken <0.01% of the time. The fact that we have three areas isn't really 
relevant to the problem itself; I think any solution should allow an 
arbitrary number of areas.

We've done some experiments where we collapse all three areas into one, 
so A's main code is followed by A's cold code, which is followed by A's 
frozen code, and so on for B and other tracelets, like this:

bad layout: | A_main A_cold A_frozen B_main B_cold B_frozen... |

This causes an unacceptable performance regression, so for our 
experimental llvm backend we need to come up with a solution.

We're using a custom calling convention to model each tracelet as a 
single llvm function. I know there are intrinsics in llvm that allow us 
to give blocks relative probabilities, and if I'm understanding things 
correctly that should allow us to achieve the bad layout above, since 
the code for a given llvm function will be emitted in one contiguous 
chunk, regardless of how the blocks are organized within it. I'd like to 
figure out a way to achieve the good layout.

The best idea I have so far is to annotate each BasicBlock with an area 
tag like we do in our existing codegen backend, and then teach the llvm 
codegen backend how to place each of those blocks in the appropriate 
part of our code cache. We have a custom subclass of RTDyldMemoryManager 
to get it emitting all code in the main area right now; I assume any 
solution would impact the API of this class to get at the different areas.

If we have to split the main/cold/frozen parts of each tracelet into 
different llvm functions we might be able to make that work but I'm 
skeptical. My main concern is that cold/frozen code nearly always uses a 
bunch of values from the main code path that branched to it, and if the 
overhead of getting from main to cold/frozen is more than a single 
jmp/jcc instruction (on x86) that's going to be a dealbreaker.

Does this sound doable, and is it something you'd be ok having in llvm? 
If we can come up with a good design my team is happy to do the actual 
work, though if anyone else is interested in doing it we certainly won't 
complain :).

Thanks!
Brett