[LLVMdev] Is PIC code defeating the branch predictor?

Tue Jan 4 09:47:50 PST 2011

On Jan 4, 2011, at 12:37 AM, Owen Anderson wrote:

> 
> On Jan 3, 2011, at 11:30 PM, Jakob Stoklund Olesen wrote:
> 
>> I noticed that we generate code like this for i386 PIC:
>> 
>> 	calll	L0$pb
>> L0$pb:
>> 	popl	%eax
>> 	movl	%eax, -24(%ebp)         ## 4-byte Spill
>> 
>> I worry that this defeats the return address prediction for returns in the function because calls and returns no longer are matched.
> 
> Yes, this will defeat the processor's return address stack predictor.  That said, I suspect it's not much of an issue on "desktop" processors: the reissue of the pop is an Atom-specific issue, so you only need to worry about the branch misprediction caused on the next return.  Assuming these sequences aren't too frequent, the more elaborate tournament predictors in more powerful processors may be able to compensate for it.
> 
> That said, the alternative sequence you propose seems like it would be an improvement on any processor with a multiple issue pipeline (unless ret does a lot more work than I think it does), though it doesn't fix the reissued pop problem on Atom.

Since PIC was around when the current Intel micro architecture was designed, one could speculate that it can recognize a zero-length call and knows to ignore it for branch prediction? I think the call+pop sequence is quite normal.

Strangely, the optimization reference lists both code snippets in the Atom section, but doesn't recommend one over the other.

I think the matched call+ret is best if we could stick some more instructions in there. Transform this:

BB1:
	foo
	bar
	%eax = pic_base
	baz

Into this:

BB1:
	call BBx
	baz
...
BBX:
	foo
	bar
	movl (%esp), %eax
	ret

I don't know if it is worth it. The code appears in 32-bit PIC functions that access globals.

/jakob