[LLVMdev] RFC: PerfGuide for frontend authors

Sat Feb 28 14:50:57 PST 2015



> On Feb 28, 2015, at 2:30 PM, Björn Steinbrink <bsteinbr at gmail.com> wrote:
> 
>> On 2015.02.28 14:23:02 -0800, Philip Reames wrote:
>>> On 02/28/2015 10:04 AM, Björn Steinbrink wrote:
>>> Hi,
>>> 
>>>> On 2015.02.28 10:53:35 -0600, Hal Finkel wrote:
>>>> ----- Original Message -----
>>>>> From: "Philip Reames" <listmail at philipreames.com>
>>>>>> 6. Use the lifetime.start/lifetime.end and
>>>>>> invariant.start/invariant.end intrinsics where possible
>>>>> Do you find these help in practice?  The few experiments I ran were
>>>>> neutral at best and harmful in one or two cases.  Do you have
>>>>> suggestions on how and when to use them?
>>>> Good point, we should be more specific here. My, admittedly limited,
>>>> experience with these is that they're most useful when their
>>>> properties are not dynamic -- which perhaps means that they
>>>> post-dominate the entry, and are applied to allocas in the entry block
>>>> -- and the larger the objects in question, the more the potential
>>>> stack-space savings, etc.
>>> my experience adding support for the lifetime intrinsics to the rust
>>> compiler is largely positive (because our code is very stack heavy at
>>> the moment), but we still suffer from missed memcpy optimizations.
>>> That happens because I made the lifetime regions as small as possible,
>>> and sometimes an alloca starts its lifetime too late for the optimization
>>> to happen.  My new (but not yet implemented) approach to to "align" the
>>> calls to lifetime.start for allocas with overlapping lifetimes unless
>>> there's actually a possibility for stack slot sharing.
>>> 
>>> For example we currently translate:
>>> 
>>>    let a = [0; 1000000]; // Array of 1000000 zeros
>>>    {
>>>      let b = a;
>>>    }
>>>    let c = something;
>>> 
>>> to roughly this:
>>> 
>>>    lifetime.start(a)
>>>    memset(a, 0, 1000000)
>>>    lifetime.start(b)
>>>    memcpy(b, a)
>>>    lifetime.end(b)
>>>    lifetime.start(c)
>>>    lifetime.end(c)
>>>    lifetime.end(a)
>>> 
>>> The lifetime.start call for "b" stops the call-slot (I think)
>>> optimization from being applied. So instead this should be translated to
>>> something like:
>>> 
>>>    lifetime.start(a)
>>>    lifetime.start(b)
>>>    memset(a, 0, 1000000)
>>>    memcpy(b, a)
>>>    lifetime.end(b)
>>>    lifetime.start(c)
>>>    lifetime.end(c)
>>>    lifetime.end(a)
>>> 
>>> extending the lifetime of "b" because it overlaps with that of "a"
>>> anyway. The lifetime of "c" still starts after the end of "b"'s lifetime
>>> because there's actually a possibility for stack slot sharing.
>>> 
>>> Björn
>> I'd be interested in seeing the IR for this that you're currently
>> generating.  Unless I'm misreading your example, everything in this is
>> completely dead.  We should be able to reduce this to nothing and if we
>> can't, it's clearly a missed optimization.  I'm particularly interested in
>> how the difference in placement of the lifetime start for 'b' effects
>> optimization.  I really wouldn't expect that.
> 
> I should have clarified that that was a reduced, incomplete example, the
> actual code looks like this (after optimizations):
> 
>  define void @_ZN9test_func20hdd8a534ccbedd903paaE(i1 zeroext) unnamed_addr #0 {
>  entry-block:
>    %x = alloca [100000 x i32], align 4
>    %1 = bitcast [100000 x i32]* %x to i8*
>    %arg = alloca [100000 x i32], align 4
>    call void @llvm.lifetime.start(i64 400000, i8* %1)
>    call void @llvm.memset.p0i8.i64(i8* %1, i8 0, i64 400000, i32 4, i1 false)
>    %2 = bitcast [100000 x i32]* %arg to i8*
>    call void @llvm.lifetime.start(i64 400000, i8* %2) ; this happens too late
>    call void @llvm.memcpy.p0i8.p0i8.i64(i8* %2, i8* %1, i64 400000, i32 4, i1 false)
>    call void asm "", "r,~{dirflag},~{fpsr},~{flags}"([100000 x i32]* %arg) #2, !noalias !0, !srcloc !3
>    call void @llvm.lifetime.end(i64 400000, i8* %2) #2, !alias.scope !4, !noalias !0
>    call void @llvm.lifetime.end(i64 400000, i8* %2)
>    call void @llvm.lifetime.end(i64 400000, i8* %1)
>    ret void
>  }
> 
> If the lifetime start for %arg is moved up, before the memset, the
> callslot optimization can take place and the %c alloca is eliminated,
> but with the lifetime starting after the memset, that isn't possible.
This bit of ir actually seems pretty reasonable given the inline asm.  The only thing I really see is that the memcpy could be a memset.  Are you expecting something else?

Philip