[PATCH] Improvements to SSA construction

Mon Apr 6 21:42:28 PDT 2015

LGTM algorithmically.

On Mon, Apr 6, 2015 at 6:02 PM, Cameron Zwarich <zwarich at apple.com> wrote:
> Thanks. If someone reviews the content of my 'a' patch then I'll land it.
>
> Cameron
>
>> On Apr 6, 2015, at 4:59 PM, Daniel Berlin <dberlin at dberlin.org> wrote:
>>
>> So, to drive worst case behavior (besides all the other testing), i
>> wrote the attached to generate ladder graphs of size N, which are the
>> worst case for DF/IDF algorithms (N^2)
>>
>> If you want perfect ladder looking graphs, you have to use
>> -simplifycfg on the result.
>> Note: Ladder graphs seems to generate at least n^2 behavior in
>> GraphWriter, so using a 50000 rung ladder graph doesn't seem sane with
>> -view-cfg.  It doesn't use n^2 memory at least, but i stopped it
>> after 15 minutes.
>>
>> Anyway, as one might expect, even something like using -domfrontier on
>> a 50000 rung ladder graph is not possible.
>>
>> The same is true for the algorithm before your patches.
>> Before your patches, i stopped it at 5 minutes.
>> After your patches, it takes <50ms.
>>
>> This is tested with memoryssa, which only builds SSA for one variable,
>> and does no pruning, which is very easy to test against.
>> It's safe to say "it's always an improvement" there
>>
>> For promotememorytoregisters, besides generating code that defeats the
>> pruning,  I tested across a ton of different types of code for both of
>> these, and can't find a single case where it's slower.
>>
>>
>>> On Mon, Apr 6, 2015 at 4:39 PM, Cameron Zwarich <zwarich at apple.com> wrote:
>>> How does perf look relative to the current code in the tree?
>>>
>>> Cameron
>>>
>>>> On Apr 6, 2015, at 3:57 PM, Daniel Berlin <dberlin at dberlin.org> wrote:
>>>>
>>>> So, i can't find a testcase up to 100k bb's where *anything* matters
>>>> in terms of a vs b.
>>>> I also tried it on 50k functions of 1000 bb's as well.
>>>>
>>>> In trying to use setvector, i found another set of much simpler
>>>> reasons it won't work:
>>>>
>>>> 1. We clear worklist every time we pop from the PQ. We can't do this
>>>> with setvector
>>>> 2. We pop values off the worklist, which would erase them from
>>>> setvector, but as implemented, we don't want them revisited anyway.
>>>>
>>>> IE Right now,  a value inserted into visitedworklist stays there the
>>>> entire function, but a value inserted into worklist is cleared every
>>>> time we pop something off the PQ.
>>>>
>>>>
>>>> #1 is pointless (in the sense that the worklist is always empty at the
>>>> point it's called, i assume it's there to do shrink to fit or
>>>> something)
>>>>
>>>> Unless #2 is a bug, we can't use setvector.
>>>>
>>>>
>>>>
>>>>
>>>>> On Mon, Apr 6, 2015 at 3:17 PM, Daniel Berlin <dberlin at dberlin.org> wrote:
>>>>> I'll time it.
>>>>>
>>>>> I'll use a GCC testcase that generates fully connected graphs.
>>>>>
>>>>> (I've already tried 10000+ bb cases with if's and and merges, and
>>>>> can't get a vs b to have any real time difference)
>>>>>
>>>>>
>>>>>> On Mon, Apr 6, 2015 at 2:04 PM, Chandler Carruth <chandlerc at google.com> wrote:
>>>>>> Sorry, I clearly meant SmallSetVector (which uses SmallVector and
>>>>>> SmallPtrSet).
>>>>>>
>>>>>> Also, I was not supposing you would touch the blocks thing at all, nor
>>>>>> thinking this would remove any non-determinism. It was just that if you're
>>>>>> pushing onto a vector only when insertion into a set vector succeeds, it
>>>>>> might make sense. There are still plenty of cases where it doesn't, either
>>>>>> due to type mis-matches, or because you want to keep growing the set while
>>>>>> shrinking and re-using space in the vector.
>>>>>>
>>>>>>> On Mon, Apr 6, 2015 at 2:02 PM Cameron Zwarich <zwarich at apple.com> wrote:
>>>>>>>
>>>>>>> Yeah, that’s a bigger problem. My original complaints 2) and 3) are
>>>>>>> invalid because those are template parameters that could presumedly be
>>>>>>> swapped out for SmallVector / SmallPtrSet anyways.
>>>>>>>
>>>>>>> Cameron
>>>>>>>
>>>>>>>> On Apr 6, 2015, at 2:00 PM, Daniel Berlin <dberlin at dberlin.org> wrote:
>>>>>>>>
>>>>>>>> Unless i'm missing something, you can't use setvector anyway, because
>>>>>>>> it won't remove the non-determinism the sort is there to fix.
>>>>>>>>
>>>>>>>> That non-determinism comes from the fact that defblocks is a
>>>>>>>> smallptrset, and even if the priority queue's order is stable, the
>>>>>>>> only guarantee you get is that things at the same dom level end up
>>>>>>>> around the same place in the priority queue (but not in a specific
>>>>>>>> order).
>>>>>>>>
>>>>>>>>
>>>>>>>> If domtreenodepair was domtreenodetriple (<BasicBlock, DomLevel,
>>>>>>>> BBNumbers[BB]>), and sorted by second then third part of tuple,
>>>>>>>> dfblocks could be a std::vector and you could use setvector, and not
>>>>>>>> sort anything, because it would be deterministic.
>>>>>>>>
>>>>>>>> Whether any of this is worth it, no idea.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Apr 6, 2015 at 1:56 PM, Cameron Zwarich <zwarich at apple.com>
>>>>>>>> wrote:
>>>>>>>>> I just thought it would make aesthetic sense to have less code in the
>>>>>>>>> inner
>>>>>>>>> loop, but that’s arbitrary. If people think the ‘a’ patch makes more
>>>>>>>>> sense
>>>>>>>>> and it’s equivalent perfwise, then I’m fine with that. Code is made for
>>>>>>>>> reading, not writing. :P
>>>>>>>>>
>>>>>>>>> I’m not sure SetVector makes sense here for a few reasons:
>>>>>>>>>
>>>>>>>>> 1) The set and the vector store different types; the vector also
>>>>>>>>> contains
>>>>>>>>> the BB number.
>>>>>>>>> 2) The vector in SetVector is a std::vector, not a SmallVector.
>>>>>>>>> 3) The set in SetVector is a SmallSet of a fixed size of 16 (smaller
>>>>>>>>> than
>>>>>>>>> the 32 used here), and the ‘large’ case of SmallSet is a std::set,
>>>>>>>>> which is
>>>>>>>>> probably slower than SmallPtrSet’s ‘large’ case.
>>>>>>>>>
>>>>>>>>> Cameron
>>>>>>>>>
>>>>>>>>> On Apr 6, 2015, at 1:40 PM, Chandler Carruth <chandlerc at google.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> I'd like to understand why Cameron would prefer the 'b' patch to the
>>>>>>>>> 'a'
>>>>>>>>> patch. AFAICT, the 'b' patch doesn't save any actual memory (sadly).
>>>>>>>>>
>>>>>>>>> Also, if we go with 'a', would it make sense to use a SetVector rather
>>>>>>>>> than
>>>>>>>>> a separate vector and set?
>>>>>>>>>
>>>>>>>>> On Mon, Apr 6, 2015 at 1:36 PM Daniel Berlin <dberlin at dberlin.org>
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> I'm running numbers on both approaches now, to see if there is any
>>>>>>>>>> real difference in speed.
>>>>>>>>>>
>>>>>>>>>> (otherwise, i think the one with two visited worklists is easier to
>>>>>>>>>> understand, unless someone else wants to disagree :P)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Apr 6, 2015 at 11:54 AM, Quentin Colombet
>>>>>>>>>> <qcolombet at apple.com>
>>>>>>>>>> wrote:
>>>>>>>>>>> Hi Cameron,
>>>>>>>>>>>
>>>>>>>>>>> Sounds good.
>>>>>>>>>>>
>>>>>>>>>>> Let us wait the complete testing from Daniel before reviewing the
>>>>>>>>>>> patch.
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> -Quentin
>>>>>>>>>>>> On Apr 6, 2015, at 11:27 AM, Cameron Zwarich <zwarich at apple.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> It was pointed out to me (without any specifics) that the iterated
>>>>>>>>>>>> dominance frontier algorithm in PromoteMemoryToRegister.cpp has
>>>>>>>>>>>> O(n^2) worst
>>>>>>>>>>>> case behavior.
>>>>>>>>>>>>
>>>>>>>>>>>> I inspected the code and think I found the cause. The code uses a
>>>>>>>>>>>> priority queue and a worklist, which share the same visited set, but
>>>>>>>>>>>> the
>>>>>>>>>>>> visited set is only updated when inserting into the priority queue.
>>>>>>>>>>>> The
>>>>>>>>>>>> original Sreedhar-Gao paper effectively has a second visited set
>>>>>>>>>>>> (the InPhi
>>>>>>>>>>>> flag) which is used for the priority queue, and the set called
>>>>>>>>>>>> Visited is
>>>>>>>>>>>> used for the recursive traversal that is done here with a worklist.
>>>>>>>>>>>>
>>>>>>>>>>>> I’ve attached two patches, one which just adds a second visited sit,
>>>>>>>>>>>> and another which leverages the fact that one of the visited sets is
>>>>>>>>>>>> actually the IDF. I would prefer the latter if it has equal
>>>>>>>>>>>> performance with
>>>>>>>>>>>> the first.
>>>>>>>>>>>>
>>>>>>>>>>>> They both pass `make check`, but I’m not sure I’ll have time to give
>>>>>>>>>>>> these patches the testing they’ll deserve in the next few days.
>>>>>>>>>>>> Daniel
>>>>>>>>>>>> Berlin has offered to test them more thoroughly for me.
>>>>>>>>>>>>
>>>>>>>>>>>> Note that there is still one difference with the paper. The paper
>>>>>>>>>>>> uses
>>>>>>>>>>>> a custom linked data structure instead of a priority queue, which
>>>>>>>>>>>> takes
>>>>>>>>>>>> advantage of the property that the level of all nodes being inserted
>>>>>>>>>>>> is at
>>>>>>>>>>>> most the current level. The code in LLVM uses a priority queue based
>>>>>>>>>>>> on a
>>>>>>>>>>>> binary heap. This means that the worst case is O(n log n), but I’d
>>>>>>>>>>>> be
>>>>>>>>>>>> surprised if the difference matters in practice.
>>>>>>>>>>>>
>>>>>>>>>>>> Cameron
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> <idf-faster-a.patch><idf-faster-b.patch>_______________________________________________
>>>>>>>>>>>> llvm-commits mailing list
>>>>>>>>>>>> llvm-commits at cs.uiuc.edu
>>>>>>>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> llvm-commits mailing list
>>>>>>>>>> llvm-commits at cs.uiuc.edu
>>>>>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> llvm-commits mailing list
>>>>>>>>> llvm-commits at cs.uiuc.edu
>>>>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>> <laddergraph.py>