[PATCH] Devirtualize llvm::Value and all subclasses

Fri Jun 26 17:04:53 PDT 2015

On Fri, Jun 26, 2015 at 3:47 PM, Pete Cooper <peter_cooper at apple.com> wrote:

>
> On Jun 26, 2015, at 3:26 PM, David Blaikie <dblaikie at gmail.com> wrote:
>
>
>
> On Fri, Jun 26, 2015 at 3:05 PM, Pete Cooper <peter_cooper at apple.com>
> wrote:
>
>>
>> On Jun 26, 2015, at 2:14 PM, David Blaikie <dblaikie at gmail.com> wrote:
>>
>>
>>
>> On Fri, Jun 26, 2015 at 1:50 PM, Pete Cooper <peter_cooper at apple.com>
>> wrote:
>>
>>>
>>> On Jun 26, 2015, at 1:42 PM, David Blaikie <dblaikie at gmail.com> wrote:
>>>
>>> Completely shooting from the hip, we have been seeing some slow-downs
>>> over the last week or so & there's at least a vague guess that it could be
>>> related to these pervasive data structure changes that have been going in
>>> within that window.
>>>
>>> Thats a reasonable guess.  I’d think the same.
>>>
>>>
>>> Have you been running/seen any compile-time perf changes in that same
>>> window?
>>>
>>> I have yeah.  We have the bitcode saved from the linker for
>>> verify_uselist_order with and without debug info and before/after
>>> optimizations.  I’ve been putting those through either opt/llc and
>>> measuring compile time.  It never strayed from noise TBH.
>>>
>>
>> What about broader perf numbers? Not seeing any significant fluctuations
>> in the last 7-10 days?
>>
>> I just took a look at our internal compile-time graphs for SPEC2000.
>> They are showing a general upwards trend recently which is worrying.
>>
>> Looking at 2 ~1% spikes, I have commits in the range r239993-r240004 and
>> r240426-r240456.
>>
>
> Went back & looked at what we first started seeing, and looks to be in the
> 239765 (good) - 239809 (bad) range & more than ~1% (in the 7% (-O0) to 17%
> (-O3 msan) range)…
>
> Wow, I didn’t observe anything remotely close to those numbers.
>

On further inspection Richard Trieu reckons it might've gone as far
back as r238924,
for whatever that's worth.

>
>
> It's still perfectly possible this is some internal infrastructure issue.
> If we find something more actionable, I'll be certain share.
>
> Cool, thanks.
>
>
> Nothing it that range (in LLVM at least, haven't checked Clang) looks
> particularly suspicious to me... but I'm not as skilled at this.
>
> I don’t see anything obvious either.  Lots of code cleanup and additions
> of new features, but no radical changes to existing code.
>
> Pete
>
>
>
>>
>> There are none of mine in those particular ranges (i’m totally not trying
>> to pass the buck here BTW, just that the largest recent spikes I happened
>> to look at don’t have my patches).  Its very possible my patches have a
>> negative compile time impact which is part of the general trend and doesn’t
>> show as a spike.
>>
>> Prior to those ranges, there’s another compile-time spike we never
>> recovered from in the range r237378-r237422.  I believe the likely
>> candidate here is r237395 "Add another InstCombine pass after
>> LoopUnroll.”.  However, i’m not saying we should necessarily change that
>> commit as it likely gives runtime perf gains which make up for the compile
>> time (speculating here, didn’t actually check).  Anyway, its an old commit
>> range at this point, just wanted to point it out while I was looking at the
>> graphs anyway, and it caught my attention.
>>
>>
>> One of the things I was thinking, if this is an issue, is that we could
>>> do something vtable-esque. Since we already have an enum to differentiate
>>> the various derived classes - could we have an array of manual vtables,
>>> indexed by that enum, and explicitly invoke through that? If the switch
>>> stuff is slowing things down - possible that it isn't, and the switch is
>>> cheaper than the indirect call, etc.
>>>
>>> I was about to reply ‘surely the optimizer will just make a jump table
>>> from the switch which is just as fast’.
>>>
>>
>> Could be, I really don't know.
>>
>>
>>> However, then I thought about what that would look like, and its not
>>> quite what you’re suggesting.  The jump table optimization would only jump
>>> to a block which then does the call, but the call itself is then another
>>> jump.
>>>
>>> We couldn’t rely on it 100%, but I wonder if we can teach the optimizer
>>> to optimize jump tables to calls where every call has identical argument
>>> lists.
>>>
>>> Anyway, it probably wouldn’t help with the performance tracking for me
>>> to commit this (assuming someone gives it a LGTM), so i’ll hold off for now.
>>>
>>
>> Yeah, until you/I/we/someone has actually reliable perf stuff that points
>> to any particular change or changes, I probably wouldn't start shooting in
>> the dark - just figured I'd bring it up in case it's something you're
>> already aware of or, upon looking, would become so.
>>
>> Thanks for checking, and let me know if I can be of any more help.  I’ll
>> certainly be more stringent with checking compile time on these patches.
>> Don’t want to even slightly help compile time get any worse.
>>
>> Cheers,
>> Pete
>>
>>
>> - Dave
>>
>>
>>>
>>> Pete
>>>
>>>
>>> Just some idle thoughts.
>>>
>>> On Tue, Jun 16, 2015 at 4:46 PM, Pete Cooper <peter_cooper at apple.com>
>>> wrote:
>>>
>>>> Hi Nick
>>>>
>>>> As you were the last person to attempt this on PR889, I thought i’d see
>>>> how you like this series of patches.
>>>>
>>>> I’ve tried to devirtualize a small number of virtual methods at a time,
>>>> hence the bunch of patches.  Unfortunately the last one is pretty huge as
>>>> it handles all the places we used to call delete on Value’s and now need to
>>>> call a destroy method to dispatch to the correct subclass.  I’m not sure
>>>> there’s a great way to split the last one up.
>>>>
>>>> David, i’ve CCed you specifically to ask about how safe some of the
>>>> changes here are.  In particular, the way i’ve changed things is:
>>>> - If a subclass is final (and i’ve added final where I can), then I
>>>> give it its own delete method which just does ‘using User::operator
>>>> delete’.  Is this reasonable?  Do i need to handle any other kinds of
>>>> delete like User does?
>>>> - If a subclass is not able to be final, then i’ve added a destroy()
>>>> method and switch on the value kind to then call ‘delete
>>>> cast<SubClass>(this)’
>>>>
>>>> Thanks to Duncan for the tip on std::default_delete, i’ve made use of
>>>> this on various places which had a std::unique_ptr of say Instruction or
>>>> Value.  The default deleter for those calls destroy() instead of delete.
>>>> Deleter’s aren’t needed for final subclasses because I was able to give
>>>> them a delete method instead.
>>>>
>>>> I also extended helpers like DeleteContainerPointers to take a
>>>> std::default_delete so that we can forward to the destroy methods where
>>>> needed.
>>>>
>>>> I measured memory usage on the verify_uselist_order test case Duncan
>>>> has been using.  This contains debug info and is measured with 'llc
>>>> ~/verify-uselistorder.lto.opt.bc -o /dev/null --filetype=obj
>>>> -disable-verify’.  Prior to this change we consumed 814MB, and 800MB
>>>> after.  So almost 2% savings.  This is in line with what I was expecting as
>>>> the savings is 8 bytes per Value, but due to malloc alignment on Mac OS, we
>>>> only save that on an average of half of values.
>>>>
>>>> Finally, this is the order I did all the work locally.  If any of this
>>>> makes sense to be squashed or split, please let me know.  In the first
>>>> patch for example I create Constant.def, but I later change that to
>>>> Value.def.  This one should obviously be fixed to be Value.def from the
>>>> start.
>>>>
>>>> Cheers,
>>>> Pete
>>>>
>>>>
>>>
>>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20150626/3a1cdc48/attachment.html>