[LLVMdev] [unladen-swallow] Re: Why does the x86-64 JIT emit stubs for external calls?

Thu Jun 11 16:24:50 PDT 2009

On Thu, Jun 11, 2009 at 12:54 PM, Evan Cheng<evan.cheng at apple.com> wrote:
>
>
>
> On Jun 10, 2009, at 12:17 PM, Jeffrey Yasskin wrote:
>
>> In X86CodeGen.cpp, the following code appears in the handler used for
>> CALL64pcrel32 instructions:
>>
>>       // Assume undefined functions may be outside the Small codespace.
>>       bool NeedStub =
>>         (Is64BitMode &&
>>             (TM.getCodeModel() == CodeModel::Large ||
>>              TM.getSubtarget<X86Subtarget>().isTargetDarwin())) ||
>>         Opcode == X86::TAILJMPd;
>>       emitGlobalAddress(MO.getGlobal(), X86::reloc_pcrel_word,
>>                         MO.getOffset(), 0, NeedStub);
>>
>> This causes every external call to be emitted as a call to a stub
>> which then jumps to the real function.
>> I understand, thanks to the helpful folks on #llvm, that calls across
>> more than 31 bits of address space need to be emitted as a "mov
>> $ADDRESS, r10; call *r10" pair instead of the simple "call
>> rip+ADDRESS" used for calls within 31 bits. But why isn't the mov+call
>> pair emitted inline? And why are Darwin and TAILJMPs special?
>
> This is needed because of lazy compilation, before the callee is resolved,
> it is just a JIT stub.

Even with lazy compilation, the contents of the stub get emitted (by
JITEmitter::getPointerToGlobal) as a direct call to the function, not
the compilation callback, because the function is an external
declaration. You can watch this happen with the following program:

declare i32 @rand()

define i32 @main() nounwind {
entry:
	%call = tail call i32 @rand()		; <i32> [#uses=1]
	%add = add i32 %call, 2		; <i32> [#uses=1]
	ret i32 %add
}

and the command line `lli -debug-only=jit -march=x86-64 test.bc`.

With lazy compilation and a call to an internal function, the
JITEmitter can emit a stub even if MachineRelocation::doesntNeedStub()
(the field NeedStub gets passed into) returns true. Only returning
false constrains the emitter.

> It's heap allocated so it may not be in the lower 4G
> even if the code size model is small. I know this is the case on Darwin
> x86_64, I am not sure about other targets.

Oh, other targets can certainly allocate code above 4G too.
sys::AllocateRWX just uses mmap with no constraints on the returned
address, and I've got a Linux desktop where that always produces an
address over 4G.

> I forgot why this is needed for
> tail calls, sorry.
>
> In theory we can make the code generator inline mov+call, the reality is it
> doesn't know whether it's jitting or not. Also, we really want to keep the
> code generation the same (as much as possible) whether it's jitting or
> compiling. One possible solution for this is to add code size model
> specifically for JIT so code generator can generate more efficient code in
> that configuration.

For non-JIT, the code generator doesn't ever need a stub, right? The
linker does it using the relocation information? It must be ignoring
the NeedStub parameter. ... But wait, is this code generator used for
anything besides the JIT? Compiling uses the AsmPrinter until direct
object code generation lands, and presumably they're redesigning this
whole subsystem.

It sounds like I'd have to fully understand the whole structure of the
code generator to fix this, and for <=2% performance, that's not
really worth it. I'll probably wait for the direct object code people
to get around to it. Thanks though.

>>
>>
>> Having this out of line seems to lose up to 2% performance on the
>> Unladen Swallow benchmarks, so, while it's not urgent, it'd be nice to
>> figure out how to avoid the stubs.
>>
>> What kind of patch would be welcome to fix this?
>>
>> Thanks,
>> Jeffrey
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>