[PATCH] Improvements to SSA construction

Fri Apr 10 09:21:18 PDT 2015

Just wanted to ask:
Do you guys want anything else in the way of testing, or is it just
waiting on one of (Quentin, Chandler) to look at?

On Mon, Apr 6, 2015 at 9:42 PM, Daniel Berlin <dberlin at dberlin.org> wrote:
> LGTM algorithmically.
>
> On Mon, Apr 6, 2015 at 6:02 PM, Cameron Zwarich <zwarich at apple.com> wrote:
>> Thanks. If someone reviews the content of my 'a' patch then I'll land it.
>>
>> Cameron
>>
>>> On Apr 6, 2015, at 4:59 PM, Daniel Berlin <dberlin at dberlin.org> wrote:
>>>
>>> So, to drive worst case behavior (besides all the other testing), i
>>> wrote the attached to generate ladder graphs of size N, which are the
>>> worst case for DF/IDF algorithms (N^2)
>>>
>>> If you want perfect ladder looking graphs, you have to use
>>> -simplifycfg on the result.
>>> Note: Ladder graphs seems to generate at least n^2 behavior in
>>> GraphWriter, so using a 50000 rung ladder graph doesn't seem sane with
>>> -view-cfg.  It doesn't use n^2 memory at least, but i stopped it
>>> after 15 minutes.
>>>
>>> Anyway, as one might expect, even something like using -domfrontier on
>>> a 50000 rung ladder graph is not possible.
>>>
>>> The same is true for the algorithm before your patches.
>>> Before your patches, i stopped it at 5 minutes.
>>> After your patches, it takes <50ms.
>>>
>>> This is tested with memoryssa, which only builds SSA for one variable,
>>> and does no pruning, which is very easy to test against.
>>> It's safe to say "it's always an improvement" there
>>>
>>> For promotememorytoregisters, besides generating code that defeats the
>>> pruning,  I tested across a ton of different types of code for both of
>>> these, and can't find a single case where it's slower.
>>>
>>>
>>>> On Mon, Apr 6, 2015 at 4:39 PM, Cameron Zwarich <zwarich at apple.com> wrote:
>>>> How does perf look relative to the current code in the tree?
>>>>
>>>> Cameron
>>>>
>>>>> On Apr 6, 2015, at 3:57 PM, Daniel Berlin <dberlin at dberlin.org> wrote:
>>>>>
>>>>> So, i can't find a testcase up to 100k bb's where *anything* matters
>>>>> in terms of a vs b.
>>>>> I also tried it on 50k functions of 1000 bb's as well.
>>>>>
>>>>> In trying to use setvector, i found another set of much simpler
>>>>> reasons it won't work:
>>>>>
>>>>> 1. We clear worklist every time we pop from the PQ. We can't do this
>>>>> with setvector
>>>>> 2. We pop values off the worklist, which would erase them from
>>>>> setvector, but as implemented, we don't want them revisited anyway.
>>>>>
>>>>> IE Right now,  a value inserted into visitedworklist stays there the
>>>>> entire function, but a value inserted into worklist is cleared every
>>>>> time we pop something off the PQ.
>>>>>
>>>>>
>>>>> #1 is pointless (in the sense that the worklist is always empty at the
>>>>> point it's called, i assume it's there to do shrink to fit or
>>>>> something)
>>>>>
>>>>> Unless #2 is a bug, we can't use setvector.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> On Mon, Apr 6, 2015 at 3:17 PM, Daniel Berlin <dberlin at dberlin.org> wrote:
>>>>>> I'll time it.
>>>>>>
>>>>>> I'll use a GCC testcase that generates fully connected graphs.
>>>>>>
>>>>>> (I've already tried 10000+ bb cases with if's and and merges, and
>>>>>> can't get a vs b to have any real time difference)
>>>>>>
>>>>>>
>>>>>>> On Mon, Apr 6, 2015 at 2:04 PM, Chandler Carruth <chandlerc at google.com> wrote:
>>>>>>> Sorry, I clearly meant SmallSetVector (which uses SmallVector and
>>>>>>> SmallPtrSet).
>>>>>>>
>>>>>>> Also, I was not supposing you would touch the blocks thing at all, nor
>>>>>>> thinking this would remove any non-determinism. It was just that if you're
>>>>>>> pushing onto a vector only when insertion into a set vector succeeds, it
>>>>>>> might make sense. There are still plenty of cases where it doesn't, either
>>>>>>> due to type mis-matches, or because you want to keep growing the set while
>>>>>>> shrinking and re-using space in the vector.
>>>>>>>
>>>>>>>> On Mon, Apr 6, 2015 at 2:02 PM Cameron Zwarich <zwarich at apple.com> wrote:
>>>>>>>>
>>>>>>>> Yeah, that’s a bigger problem. My original complaints 2) and 3) are
>>>>>>>> invalid because those are template parameters that could presumedly be
>>>>>>>> swapped out for SmallVector / SmallPtrSet anyways.
>>>>>>>>
>>>>>>>> Cameron
>>>>>>>>
>>>>>>>>> On Apr 6, 2015, at 2:00 PM, Daniel Berlin <dberlin at dberlin.org> wrote:
>>>>>>>>>
>>>>>>>>> Unless i'm missing something, you can't use setvector anyway, because
>>>>>>>>> it won't remove the non-determinism the sort is there to fix.
>>>>>>>>>
>>>>>>>>> That non-determinism comes from the fact that defblocks is a
>>>>>>>>> smallptrset, and even if the priority queue's order is stable, the
>>>>>>>>> only guarantee you get is that things at the same dom level end up
>>>>>>>>> around the same place in the priority queue (but not in a specific
>>>>>>>>> order).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> If domtreenodepair was domtreenodetriple (<BasicBlock, DomLevel,
>>>>>>>>> BBNumbers[BB]>), and sorted by second then third part of tuple,
>>>>>>>>> dfblocks could be a std::vector and you could use setvector, and not
>>>>>>>>> sort anything, because it would be deterministic.
>>>>>>>>>
>>>>>>>>> Whether any of this is worth it, no idea.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Apr 6, 2015 at 1:56 PM, Cameron Zwarich <zwarich at apple.com>
>>>>>>>>> wrote:
>>>>>>>>>> I just thought it would make aesthetic sense to have less code in the
>>>>>>>>>> inner
>>>>>>>>>> loop, but that’s arbitrary. If people think the ‘a’ patch makes more
>>>>>>>>>> sense
>>>>>>>>>> and it’s equivalent perfwise, then I’m fine with that. Code is made for
>>>>>>>>>> reading, not writing. :P
>>>>>>>>>>
>>>>>>>>>> I’m not sure SetVector makes sense here for a few reasons:
>>>>>>>>>>
>>>>>>>>>> 1) The set and the vector store different types; the vector also
>>>>>>>>>> contains
>>>>>>>>>> the BB number.
>>>>>>>>>> 2) The vector in SetVector is a std::vector, not a SmallVector.
>>>>>>>>>> 3) The set in SetVector is a SmallSet of a fixed size of 16 (smaller
>>>>>>>>>> than
>>>>>>>>>> the 32 used here), and the ‘large’ case of SmallSet is a std::set,
>>>>>>>>>> which is
>>>>>>>>>> probably slower than SmallPtrSet’s ‘large’ case.
>>>>>>>>>>
>>>>>>>>>> Cameron
>>>>>>>>>>
>>>>>>>>>> On Apr 6, 2015, at 1:40 PM, Chandler Carruth <chandlerc at google.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> I'd like to understand why Cameron would prefer the 'b' patch to the
>>>>>>>>>> 'a'
>>>>>>>>>> patch. AFAICT, the 'b' patch doesn't save any actual memory (sadly).
>>>>>>>>>>
>>>>>>>>>> Also, if we go with 'a', would it make sense to use a SetVector rather
>>>>>>>>>> than
>>>>>>>>>> a separate vector and set?
>>>>>>>>>>
>>>>>>>>>> On Mon, Apr 6, 2015 at 1:36 PM Daniel Berlin <dberlin at dberlin.org>
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> I'm running numbers on both approaches now, to see if there is any
>>>>>>>>>>> real difference in speed.
>>>>>>>>>>>
>>>>>>>>>>> (otherwise, i think the one with two visited worklists is easier to
>>>>>>>>>>> understand, unless someone else wants to disagree :P)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Apr 6, 2015 at 11:54 AM, Quentin Colombet
>>>>>>>>>>> <qcolombet at apple.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>> Hi Cameron,
>>>>>>>>>>>>
>>>>>>>>>>>> Sounds good.
>>>>>>>>>>>>
>>>>>>>>>>>> Let us wait the complete testing from Daniel before reviewing the
>>>>>>>>>>>> patch.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> -Quentin
>>>>>>>>>>>>> On Apr 6, 2015, at 11:27 AM, Cameron Zwarich <zwarich at apple.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> It was pointed out to me (without any specifics) that the iterated
>>>>>>>>>>>>> dominance frontier algorithm in PromoteMemoryToRegister.cpp has
>>>>>>>>>>>>> O(n^2) worst
>>>>>>>>>>>>> case behavior.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I inspected the code and think I found the cause. The code uses a
>>>>>>>>>>>>> priority queue and a worklist, which share the same visited set, but
>>>>>>>>>>>>> the
>>>>>>>>>>>>> visited set is only updated when inserting into the priority queue.
>>>>>>>>>>>>> The
>>>>>>>>>>>>> original Sreedhar-Gao paper effectively has a second visited set
>>>>>>>>>>>>> (the InPhi
>>>>>>>>>>>>> flag) which is used for the priority queue, and the set called
>>>>>>>>>>>>> Visited is
>>>>>>>>>>>>> used for the recursive traversal that is done here with a worklist.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I’ve attached two patches, one which just adds a second visited sit,
>>>>>>>>>>>>> and another which leverages the fact that one of the visited sets is
>>>>>>>>>>>>> actually the IDF. I would prefer the latter if it has equal
>>>>>>>>>>>>> performance with
>>>>>>>>>>>>> the first.
>>>>>>>>>>>>>
>>>>>>>>>>>>> They both pass `make check`, but I’m not sure I’ll have time to give
>>>>>>>>>>>>> these patches the testing they’ll deserve in the next few days.
>>>>>>>>>>>>> Daniel
>>>>>>>>>>>>> Berlin has offered to test them more thoroughly for me.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Note that there is still one difference with the paper. The paper
>>>>>>>>>>>>> uses
>>>>>>>>>>>>> a custom linked data structure instead of a priority queue, which
>>>>>>>>>>>>> takes
>>>>>>>>>>>>> advantage of the property that the level of all nodes being inserted
>>>>>>>>>>>>> is at
>>>>>>>>>>>>> most the current level. The code in LLVM uses a priority queue based
>>>>>>>>>>>>> on a
>>>>>>>>>>>>> binary heap. This means that the worst case is O(n log n), but I’d
>>>>>>>>>>>>> be
>>>>>>>>>>>>> surprised if the difference matters in practice.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Cameron
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> <idf-faster-a.patch><idf-faster-b.patch>_______________________________________________
>>>>>>>>>>>>> llvm-commits mailing list
>>>>>>>>>>>>> llvm-commits at cs.uiuc.edu
>>>>>>>>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> llvm-commits mailing list
>>>>>>>>>>> llvm-commits at cs.uiuc.edu
>>>>>>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> llvm-commits mailing list
>>>>>>>>>> llvm-commits at cs.uiuc.edu
>>>>>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>> <laddergraph.py>