[LLVMdev] RFC: PerfGuide for frontend authors

Sat Feb 28 14:30:59 PST 2015

On 2015.02.28 14:23:02 -0800, Philip Reames wrote:
> On 02/28/2015 10:04 AM, Björn Steinbrink wrote:
> >Hi,
> >
> >On 2015.02.28 10:53:35 -0600, Hal Finkel wrote:
> >>----- Original Message -----
> >>>From: "Philip Reames" <listmail at philipreames.com>
> >>>>6. Use the lifetime.start/lifetime.end and
> >>>>invariant.start/invariant.end intrinsics where possible
> >>>Do you find these help in practice?  The few experiments I ran were
> >>>neutral at best and harmful in one or two cases.  Do you have
> >>>suggestions on how and when to use them?
> >>Good point, we should be more specific here. My, admittedly limited,
> >>experience with these is that they're most useful when their
> >>properties are not dynamic -- which perhaps means that they
> >>post-dominate the entry, and are applied to allocas in the entry block
> >>-- and the larger the objects in question, the more the potential
> >>stack-space savings, etc.
> >my experience adding support for the lifetime intrinsics to the rust
> >compiler is largely positive (because our code is very stack heavy at
> >the moment), but we still suffer from missed memcpy optimizations.
> >That happens because I made the lifetime regions as small as possible,
> >and sometimes an alloca starts its lifetime too late for the optimization
> >to happen.  My new (but not yet implemented) approach to to "align" the
> >calls to lifetime.start for allocas with overlapping lifetimes unless
> >there's actually a possibility for stack slot sharing.
> >
> >For example we currently translate:
> >
> >     let a = [0; 1000000]; // Array of 1000000 zeros
> >     {
> >       let b = a;
> >     }
> >     let c = something;
> >
> >to roughly this:
> >
> >     lifetime.start(a)
> >     memset(a, 0, 1000000)
> >     lifetime.start(b)
> >     memcpy(b, a)
> >     lifetime.end(b)
> >     lifetime.start(c)
> >     lifetime.end(c)
> >     lifetime.end(a)
> >
> >The lifetime.start call for "b" stops the call-slot (I think)
> >optimization from being applied. So instead this should be translated to
> >something like:
> >
> >     lifetime.start(a)
> >     lifetime.start(b)
> >     memset(a, 0, 1000000)
> >     memcpy(b, a)
> >     lifetime.end(b)
> >     lifetime.start(c)
> >     lifetime.end(c)
> >     lifetime.end(a)
> >
> >extending the lifetime of "b" because it overlaps with that of "a"
> >anyway. The lifetime of "c" still starts after the end of "b"'s lifetime
> >because there's actually a possibility for stack slot sharing.
> >
> >Björn
> I'd be interested in seeing the IR for this that you're currently
> generating.  Unless I'm misreading your example, everything in this is
> completely dead.  We should be able to reduce this to nothing and if we
> can't, it's clearly a missed optimization.  I'm particularly interested in
> how the difference in placement of the lifetime start for 'b' effects
> optimization.  I really wouldn't expect that.

I should have clarified that that was a reduced, incomplete example, the
actual code looks like this (after optimizations):

  define void @_ZN9test_func20hdd8a534ccbedd903paaE(i1 zeroext) unnamed_addr #0 {
  entry-block:
    %x = alloca [100000 x i32], align 4
    %1 = bitcast [100000 x i32]* %x to i8*
    %arg = alloca [100000 x i32], align 4
    call void @llvm.lifetime.start(i64 400000, i8* %1)
    call void @llvm.memset.p0i8.i64(i8* %1, i8 0, i64 400000, i32 4, i1 false)
    %2 = bitcast [100000 x i32]* %arg to i8*
    call void @llvm.lifetime.start(i64 400000, i8* %2) ; this happens too late
    call void @llvm.memcpy.p0i8.p0i8.i64(i8* %2, i8* %1, i64 400000, i32 4, i1 false)
    call void asm "", "r,~{dirflag},~{fpsr},~{flags}"([100000 x i32]* %arg) #2, !noalias !0, !srcloc !3
    call void @llvm.lifetime.end(i64 400000, i8* %2) #2, !alias.scope !4, !noalias !0
    call void @llvm.lifetime.end(i64 400000, i8* %2)
    call void @llvm.lifetime.end(i64 400000, i8* %1)
    ret void
  }

If the lifetime start for %arg is moved up, before the memset, the
callslot optimization can take place and the %c alloca is eliminated,
but with the lifetime starting after the memset, that isn't possible.

Björn