[LLVMdev] [unladen-swallow] Re: Why does the x86-64 JIT emit stubs for external calls?

Wed Jun 17 23:47:02 PDT 2009

On Jun 11, 2009, at 4:24 PM, Jeffrey Yasskin wrote:

> On Thu, Jun 11, 2009 at 12:54 PM, Evan Cheng<evan.cheng at apple.com>  
> wrote:
>>
>>
>>
>> On Jun 10, 2009, at 12:17 PM, Jeffrey Yasskin wrote:
>>
>>> In X86CodeGen.cpp, the following code appears in the handler used  
>>> for
>>> CALL64pcrel32 instructions:
>>>
>>>       // Assume undefined functions may be outside the Small  
>>> codespace.
>>>       bool NeedStub =
>>>         (Is64BitMode &&
>>>             (TM.getCodeModel() == CodeModel::Large ||
>>>              TM.getSubtarget<X86Subtarget>().isTargetDarwin())) ||
>>>         Opcode == X86::TAILJMPd;
>>>       emitGlobalAddress(MO.getGlobal(), X86::reloc_pcrel_word,
>>>                         MO.getOffset(), 0, NeedStub);
>>>
>>> This causes every external call to be emitted as a call to a stub
>>> which then jumps to the real function.
>>> I understand, thanks to the helpful folks on #llvm, that calls  
>>> across
>>> more than 31 bits of address space need to be emitted as a "mov
>>> $ADDRESS, r10; call *r10" pair instead of the simple "call
>>> rip+ADDRESS" used for calls within 31 bits. But why isn't the mov 
>>> +call
>>> pair emitted inline? And why are Darwin and TAILJMPs special?
>>
>> This is needed because of lazy compilation, before the callee is  
>> resolved,
>> it is just a JIT stub.
>
> Even with lazy compilation, the contents of the stub get emitted (by
> JITEmitter::getPointerToGlobal) as a direct call to the function, not
> the compilation callback, because the function is an external
> declaration. You can watch this happen with the following program:

There are probably some opportunities to improve upon the codegen  
here. Please file a bugzilla report, so I'd be reminded to take a look  
at some point.

>
>
> declare i32 @rand()
>
> define i32 @main() nounwind {
> entry:
> 	%call = tail call i32 @rand()		; <i32> [#uses=1]
> 	%add = add i32 %call, 2		; <i32> [#uses=1]
> 	ret i32 %add
> }
>
> and the command line `lli -debug-only=jit -march=x86-64 test.bc`.
>
> With lazy compilation and a call to an internal function, the
> JITEmitter can emit a stub even if MachineRelocation::doesntNeedStub()
> (the field NeedStub gets passed into) returns true. Only returning
> false constrains the emitter.
>
>> It's heap allocated so it may not be in the lower 4G
>> even if the code size model is small. I know this is the case on  
>> Darwin
>> x86_64, I am not sure about other targets.
>
> Oh, other targets can certainly allocate code above 4G too.
> sys::AllocateRWX just uses mmap with no constraints on the returned
> address, and I've got a Linux desktop where that always produces an
> address over 4G.
>
>> I forgot why this is needed for
>> tail calls, sorry.
>>
>> In theory we can make the code generator inline mov+call, the  
>> reality is it
>> doesn't know whether it's jitting or not. Also, we really want to  
>> keep the
>> code generation the same (as much as possible) whether it's jitting  
>> or
>> compiling. One possible solution for this is to add code size model
>> specifically for JIT so code generator can generate more efficient  
>> code in
>> that configuration.
>
> For non-JIT, the code generator doesn't ever need a stub, right? The

Right.

>
> linker does it using the relocation information? It must be ignoring
> the NeedStub parameter. ... But wait, is this code generator used for
> anything besides the JIT? Compiling uses the AsmPrinter until direct

We are talking about the system linker, it doesn't use this code. The  
code generator proper doesn't know if it's generating code for static  
compilation or for jit. The code that creates stub etc. is JIT  
specific. JIT has to do a bit more work since it can't rely on  
anything else to relocate symbols.

>
> object code generation lands, and presumably they're redesigning this
> whole subsystem.
>
>
> It sounds like I'd have to fully understand the whole structure of the
> code generator to fix this, and for <=2% performance, that's not
> really worth it. I'll probably wait for the direct object code people
> to get around to it. Thanks though.

This is not a part of the direct object code path. I'll look at it at  
some point.

Evan

>
>
>>>
>>>
>>> Having this out of line seems to lose up to 2% performance on the
>>> Unladen Swallow benchmarks, so, while it's not urgent, it'd be  
>>> nice to
>>> figure out how to avoid the stubs.
>>>
>>> What kind of patch would be welcome to fix this?
>>>
>>> Thanks,
>>> Jeffrey
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>
>>