[LLVMdev] RFC: PerfGuide for frontend authors

Sun Mar 1 09:54:42 PST 2015

On 03/01/2015 09:35 AM, Björn Steinbrink wrote:
> On 2015.03.01 09:27:02 -0800, Philip Reames wrote:
>>
>>
>>
>>> On Mar 1, 2015, at 7:53 AM, Björn Steinbrink <bsteinbr at gmail.com> wrote:
>>>
>>> On 2015.02.28 18:17:27 -0800, Philip Reames wrote:
>>>>> On Feb 28, 2015, at 3:01 PM, Björn Steinbrink <bsteinbr at gmail.com> wrote:
>>>>> 2015-02-28 23:50 GMT+01:00 Philip Reames <listmail at philipreames.com>:
>>>>>>>> On Feb 28, 2015, at 2:30 PM, Björn Steinbrink <bsteinbr at gmail.com> wrote:
>>>>>>> I should have clarified that that was a reduced, incomplete example, the
>>>>>>> actual code looks like this (after optimizations):
>>>>>>>
>>>>>>> define void @_ZN9test_func20hdd8a534ccbedd903paaE(i1 zeroext) unnamed_addr #0 {
>>>>>>> entry-block:
>>>>>>>   %x = alloca [100000 x i32], align 4
>>>>>>>   %1 = bitcast [100000 x i32]* %x to i8*
>>>>>>>   %arg = alloca [100000 x i32], align 4
>>>>>>>   call void @llvm.lifetime.start(i64 400000, i8* %1)
>>>>>>>   call void @llvm.memset.p0i8.i64(i8* %1, i8 0, i64 400000, i32 4, i1 false)
>>>>>>>   %2 = bitcast [100000 x i32]* %arg to i8*
>>>>>>>   call void @llvm.lifetime.start(i64 400000, i8* %2) ; this happens too late
>>>>>>>   call void @llvm.memcpy.p0i8.p0i8.i64(i8* %2, i8* %1, i64 400000, i32 4, i1 false)
>>>>>>>   call void asm "", "r,~{dirflag},~{fpsr},~{flags}"([100000 x i32]* %arg) #2, !noalias !0, !srcloc !3
>>>>>>>   call void @llvm.lifetime.end(i64 400000, i8* %2) #2, !alias.scope !4, !noalias !0
>>>>>>>   call void @llvm.lifetime.end(i64 400000, i8* %2)
>>>>>>>   call void @llvm.lifetime.end(i64 400000, i8* %1)
>>>>>>>   ret void
>>>>>>> }
>>>>>>>
>>>>>>> If the lifetime start for %arg is moved up, before the memset, the
>>>>>>> callslot optimization can take place and the %c alloca is eliminated,
>>>>>>> but with the lifetime starting after the memset, that isn't possible.
>>>>>> This bit of ir actually seems pretty reasonable given the inline asm.  The only thing I really see is that the memcpy could be a memset.  Are you expecting something else?
>>>>> The only thing that is to be improved is that the memset should
>>>>> directly write to %arg and %x should be removed because it is dead
>>>>> then. This happens when there are no lifetime intrinsics or when the
>>>>> call to lifetime.start is moved before the call to memset. The latter
>>>>> is what my first mail was about, that it is usually better to have
>>>>> overlapping lifetimes all start at the same point, instead of starting
>>>>> them as late as possible.
>>>> Honestly, this sounds like a clear optimizer bug, not something a frontend should work around.
>>>>
>>>> Can you file a bug with the four sets of ir?  (Both schedules, no intrinsics before and after). This should hopefully be easy to fix.
>>> I went ahead and made a fix: http://reviews.llvm.org/D7984
>> Fyi, I'm pretty sure this is the wrong approach. I need to confirm by
>> running with some test cases tomorrow, but if my initial analysis
>> holds up, the lifetime start may be just a red herring.  The
>> underlying issue appears to be a canonicalization issue with multiple
>> users of an alloca with conflicting types and the placement of the
>> bitcast used by the lifetime instruction.
> At least for the callslot optimization, the lifetime.start seems to be
> key. The call to memset isn't moved, so performing the optimization
> without moving the call to lifetime.start would result in the memset
> happening before the lifetime.start which means that it qualifies as a
> dead store and may be removed.
I actually ran my tests.  I was wrong.  My analysis of code apparently 
lead me astray.

I'll take another closer look at your patch tomorrow.

Philip