[LLVMdev] llvm-gcc + abi stuff

Wed Jan 23 21:41:41 PST 2008

<moving this to llvmdev instead of commits>

On Jan 22, 2008, at 11:23 PM, Duncan Sands wrote:

>> Okay, well we already get many other x86-64 issues wrong already, but
>> Evan is chipping away at it.  How do you pass an array by value in C?
>> Example please,
>
> I find the x86-64 ABI hard to interpret, but it seems to say that
> aggregates are classified recursively, so it looks like a struct
> containing a small integer array should be passed in integer  
> registers.

Right.  For x86-64 in particular, this happens, but only if the struct  
is <= 128 bits.

> Also, it is easy in Ada: the compiler passes small arrays by value,

Ok, we should make sure this works when we think x86-64 is "done" :)

> Can you please clarify the roles of llvm-gcc and the code generators
> in getting ABI compatibility.

Sure!  This also mixes into the discussion in PR1937.  The basic  
problem we have is that ABI decisions can be completely arbitrary, and  
are defined in terms of the source level type system.  In our desire  
to preserve the source-language-independence of llvm, we can't just  
give the code generators an AST and tell it to figure out what to do.

While it would theoretically be useful to hand the target an llvm type  
and tell it to figure it out, this also doesn't work.  The most  
trivial example of this is that some ABIs require different handling  
for _Complex double, and "struct {double,double}", both of which lower  
to the same LLVM type.  This means that the LLVM type system isn't  
rich enough by itself to fully express how the target is supposed to  
handle something.

Right now, the LLVM IR now has two ways to express argument passing +  
return value:

1) pass/return with a first class value type, like i32, float, <4 x  
i32>, etc.
2) pass/return with a pointer to the place to do things and use byval/ 
stret.

It's useful to notice that the formulation of something in the IR  
doesn't force the code generator to do anything (e.g. x86-32 passes  
almost everything in the stack regardless of whether you use scalars  
or byval), but it does have an impact on the optimizers and compile  
time.

For example, consider a target where everything is passed on the  
stack.  In this case, from a functionality perspective, it doesn't  
matter whether you use byval to pass the argument or pass it as scalar  
values.  However, picking the right one *can* have code quality and  
QOI impact.  For example, if passing a 100K struct by value, it is  
much better (in terms of compile time and generated code) to use byval  
than the scalarize it and pass all the elements.

OTOH, passing an argument 'byval' on this target prevents it from  
being SROA'd on the callee and caller side.  If the argument is small  
(say a 32-bit struct), this can cause significant performance  
degradation.  As a QOI issue, it is better to pass a small aggregate  
like this as a scalar in this case.

In practice, most targets have more complex abi's than the theoretical  
one above.  For example, x86-32 passes scalar vectors in registers up  
to a point, for example.  On that target, the code generator contract  
is that 'byval' arguments are always passed in memory, SSE-compatible  
vectors are passed in XMM registers (up to a point), and everything  
else is passed in memory.

This has somewhat interesting implications: it means that it is okay  
to pass a {i32} struct as i32, and it means passing a _Complex float  
as two floats is also fine (yay for SROA).  However, it means that  
that lowering a struct with two vectors in it into two vectors would  
actually break the ABI because the codegen would pass them in XMM regs  
instead of memory.  This is a funny dance which means that the front- 
end needs to be fully parameterized by the backend to do the lowering.

> When generating IR for x86-64, llvm-gcc
> sometimes chops by-value structs into pieces, and sometimes passes the
> struct as a byval parameter.  Since it chops up all-integer structs,
> and this corresponds more or less to what the ABI says, I assumed this
> was an attempt to get ABI correctness.  Especially as the code  
> generators
> don't seem to bother themselves with following the details of the  
> ABI (yet),
> and just push byval parameters onto the stack.

X86-64 is a much more complex abi than x86-32.  The basic form of  
correctness is that the code generator:

1. Lowers byval arguments to memory.
2. passes integer and fp and vector arguments in GPRs and XMM regs  
where available.

This has an interesting impact on the C front-end.  In particular, #1  
is great for by value aggregates > 128 bits.  However, aggregates <=  
128 bits have a variety of possible cases, including:

1. some aggregates are passed in memory.
2. Others treat the aggregate as 2 64-bit hunks, where either 64-bit  
hunk can be:
   2a. Passed in a GPR.
   2b. Passed in an XMM register.

If you consider a struct like {float,float,float,float}, the  
interesting thing about this ABI is that it says this struct is passed  
in 2 xmm regs, where two floats are each passed as the low two  
elements of the XMM regs.  To lower this struct optimally, llvm-gcc  
should codegen this as two vector inserts + two xmm registers.   
Codegen'ing it as a byval struct would be incorrect, because that  
would pass it in the stack.

I moved a big digression to the end of the mail.

I'll be the first to admit that this solution is suboptimal, but it is  
much better than what we had before.  Unresolved issues include: what  
alignment do we pass things on the stack with.  Evan recently fought  
with some crazy cases on x86-64 which currently require looking at the  
LLVM Type.  I'm not thrilled with this, but it seems like an ok thing  
to do for now.  If we find out it isn't, we'll have to extend the  
model somehow.

>>> This is an optimization, not a correctness issue
>
> I guess this means that the plan is to teach the codegenerators how to
> pass any aggregate byval in an ABI conformant way (not the case  
> right now),
> but still do some chopping up in the front-end to help the optimizers.

Right.  Currently, x86-32 attempts to pass 32-bit and 64-bit structs  
"better" than just using byval as an optimization for some common  
small cases.  However, the problem is that it doesn't generate "nice"  
accesses into the struct: it just bitcasts the pointer and does a  
32/64-bit load, which can often prevent SROA itself.  This needs to be  
fixed to get really good code, but this is an optimization, not a  
correctness issue.  Disabling this and passing these structs byval on  
x86-32 would generate equally correct code.

> Of course this chopping up needs to be done carefully so the final  
> result
> squirted out by the codegenerators (once they are ABI conformant) is  
> the
> same as if the chopping had not been done...

Right, and all this is target-specific, yuck. :)

> Is this chopping really a
> big win?  Is it not possible to get an equivalent level of  
> optimization
> by enhancing alias analysis?

Nope, AA isn't involved here, because you can't know who called you in  
general.  For example, consider this contrived example:

struct s { int x; };
int foo(struct s S) { return S.x; }

With byval, this turns into a load + return at the IR level.  Without  
byval this is just a return.  There is no amount of alias analysis you  
can do on this, because we don't know who calls it.  Without changing  
the prototype of the IR function to not be byval, you can't eliminate  
the explicit load.

The Digression:

Incidentally, on x86-64, we're currently lowering this code to  
suboptimal (but correct) code that passes this as two doubles and goes  
through memory to get it into floats instead of using vector extracts:

struct a { float w, x, y, z; };
float foo(struct a b) { return b.w+b.x+b.y+b.z; }

	%struct.a = type { float, float, float, float }

define float @foo(double %b.0, double %b.1) nounwind  {
entry:
	%b_addr = alloca { double, double }		; <{ double, double }*> [#uses=4]
	%tmpcast = bitcast { double, double }* %b_addr to %struct.a*		; < 
%struct.a*> [#uses=3]
	%tmp1 = getelementptr { double, double }* %b_addr, i32 0, i32 0		;  
<double*> [#uses=1]
	store double %b.0, double* %tmp1, align 8
	%tmp3 = getelementptr { double, double }* %b_addr, i32 0, i32 1		;  
<double*> [#uses=1]
	store double %b.1, double* %tmp3, align 8
	%tmp5 = bitcast { double, double }* %b_addr to float*		; <float*>  
[#uses=1]
	%tmp6 = load float* %tmp5, align 8		; <float> [#uses=1]
	%tmp7 = getelementptr %struct.a* %tmpcast, i32 0, i32 1		; <float*>  
[#uses=1]
	%tmp8 = load float* %tmp7, align 4		; <float> [#uses=1]
	%tmp9 = add float %tmp6, %tmp8		; <float> [#uses=1]
	%tmp10 = getelementptr %struct.a* %tmpcast, i32 0, i32 2		; <float*>  
[#uses=1]
	%tmp11 = load float* %tmp10, align 4		; <float> [#uses=1]
	%tmp12 = add float %tmp9, %tmp11		; <float> [#uses=1]
	%tmp13 = getelementptr %struct.a* %tmpcast, i32 0, i32 3		; <float*>  
[#uses=1]
	%tmp14 = load float* %tmp13, align 4		; <float> [#uses=1]
	%tmp15 = add float %tmp12, %tmp14		; <float> [#uses=1]
	ret float %tmp15
}

This yields correct but suboptimal code:
_foo:
	subq	$16, %rsp
	movsd	%xmm0, (%rsp)
	movsd	%xmm1, 8(%rsp)
	movss	(%rsp), %xmm0
	addss	4(%rsp), %xmm0
	addss	8(%rsp), %xmm0
	addss	12(%rsp), %xmm0
	addq	$16, %rsp
	ret

We really want:

_foo:
	movaps	%xmm0, %xmm2
	shufps	$1, %xmm2, %xmm2
	addss	%xmm2, %xmm0
	addss	%xmm1, %xmm0
	shufps	$1, %xmm1, %xmm1
	addss	%xmm1, %xmm0
	ret

-Chris