[cfe-dev] struct copy

Sun Sep 28 09:14:11 PDT 2008

On Sep 28, 2008, at 4:59 AM, Argiris Kirtzidis wrote:

> Chris Lattner wrote:
>>
>> In theory, it should be safe and fast for the compiler to always  
>> produce a memcpy and then let the backend lower it however it wants.
>
> What is wrong with having the compiler always produce a load/store ?

For small structs, almost nothing!  It has the same semantics as an  
element-by-element copy.  For large structs, you really don't want to  
do this.

The problem is that you need the same heuristic: you need to know that  
the struct is "small" and the struct has no holes in the LLVM type  
that some other element of the C type (e.g. through a union) contain  
data in.

Mattias wrote:
> Chris Lattner <clattner at apple.com> writes:
>
>> This is true on the micro-level, but is false in the macro level.   
>> For
>> example, if the caller of a function does a one byte store into a
>> struct field, and the callee does a memcpy (ending up with a 32-bit
>> read), you get a store forwarding speculation failure on most out of
>> order processors.
>
> Thank you, I didn't think of that. I wonder to what extent that effect
> depends on the existence of holes in the struct layout. Should structs
> always be copied member-by-member, even if they consist of many small
> bytes?
>
> struct S {
>    char c[7];
>    double d;
> }
>
> Would seven byte copies really be faster than one 64-bit word copy?
> If there were eight bytes in the array, there would be no hole but
> the problem remains - at least if the entire array was not written to
> before the struct copy.

There is no easy answer, it depends a lot on other environmental  
effects, like whether there is an access to c[3] or c[i] shortly after  
the store to the struct.

In the absence of such an access that is close enough to matter,  
you're right that it would be better to copy the 7 bytes with an 8  
byte load and store instead of 7 one by load/stores.  The idea of the  
current heuristic is that the code generator should theoretically be  
able to merge together neighboring load/stores into wider loads.  The  
two caveats here are 1) the codegen doesn't do this yet, and 2) even  
when it does, it won't know when it is safe for it to load/store  
*more* data than is requested.  In this case, it wouldn't know that it  
is safe to load/store 8 bytes, because only 7 are being accessed.

Most optimization work in LLVM is demand driven.  If you find a real  
world testcase that this impacts, we can devise some solutions.

-Chris