[PATCH] Improvements to SSA construction

Mon Apr 6 17:26:35 PDT 2015

(Small update to laddergraph.py in case anyone else cares to use it,
it now errors if you try to give it an odd number, generates bb labels
that are named right, etc. It has no effect on the numbers, of course)

On Mon, Apr 6, 2015 at 4:59 PM, Daniel Berlin <dberlin at dberlin.org> wrote:
> So, to drive worst case behavior (besides all the other testing), i
> wrote the attached to generate ladder graphs of size N, which are the
> worst case for DF/IDF algorithms (N^2)
>
> If you want perfect ladder looking graphs, you have to use
> -simplifycfg on the result.
> Note: Ladder graphs seems to generate at least n^2 behavior in
> GraphWriter, so using a 50000 rung ladder graph doesn't seem sane with
>  -view-cfg.  It doesn't use n^2 memory at least, but i stopped it
> after 15 minutes.
>
> Anyway, as one might expect, even something like using -domfrontier on
> a 50000 rung ladder graph is not possible.
>
> The same is true for the algorithm before your patches.
> Before your patches, i stopped it at 5 minutes.
> After your patches, it takes <50ms.
>
> This is tested with memoryssa, which only builds SSA for one variable,
> and does no pruning, which is very easy to test against.
> It's safe to say "it's always an improvement" there
>
> For promotememorytoregisters, besides generating code that defeats the
> pruning,  I tested across a ton of different types of code for both of
> these, and can't find a single case where it's slower.
>
>
> On Mon, Apr 6, 2015 at 4:39 PM, Cameron Zwarich <zwarich at apple.com> wrote:
>> How does perf look relative to the current code in the tree?
>>
>> Cameron
>>
>>> On Apr 6, 2015, at 3:57 PM, Daniel Berlin <dberlin at dberlin.org> wrote:
>>>
>>> So, i can't find a testcase up to 100k bb's where *anything* matters
>>> in terms of a vs b.
>>> I also tried it on 50k functions of 1000 bb's as well.
>>>
>>> In trying to use setvector, i found another set of much simpler
>>> reasons it won't work:
>>>
>>> 1. We clear worklist every time we pop from the PQ. We can't do this
>>> with setvector
>>> 2. We pop values off the worklist, which would erase them from
>>> setvector, but as implemented, we don't want them revisited anyway.
>>>
>>> IE Right now,  a value inserted into visitedworklist stays there the
>>> entire function, but a value inserted into worklist is cleared every
>>> time we pop something off the PQ.
>>>
>>>
>>> #1 is pointless (in the sense that the worklist is always empty at the
>>> point it's called, i assume it's there to do shrink to fit or
>>> something)
>>>
>>> Unless #2 is a bug, we can't use setvector.
>>>
>>>
>>>
>>>
>>> On Mon, Apr 6, 2015 at 3:17 PM, Daniel Berlin <dberlin at dberlin.org> wrote:
>>>> I'll time it.
>>>>
>>>> I'll use a GCC testcase that generates fully connected graphs.
>>>>
>>>> (I've already tried 10000+ bb cases with if's and and merges, and
>>>> can't get a vs b to have any real time difference)
>>>>
>>>>
>>>> On Mon, Apr 6, 2015 at 2:04 PM, Chandler Carruth <chandlerc at google.com> wrote:
>>>>> Sorry, I clearly meant SmallSetVector (which uses SmallVector and
>>>>> SmallPtrSet).
>>>>>
>>>>> Also, I was not supposing you would touch the blocks thing at all, nor
>>>>> thinking this would remove any non-determinism. It was just that if you're
>>>>> pushing onto a vector only when insertion into a set vector succeeds, it
>>>>> might make sense. There are still plenty of cases where it doesn't, either
>>>>> due to type mis-matches, or because you want to keep growing the set while
>>>>> shrinking and re-using space in the vector.
>>>>>
>>>>> On Mon, Apr 6, 2015 at 2:02 PM Cameron Zwarich <zwarich at apple.com> wrote:
>>>>>>
>>>>>> Yeah, that’s a bigger problem. My original complaints 2) and 3) are
>>>>>> invalid because those are template parameters that could presumedly be
>>>>>> swapped out for SmallVector / SmallPtrSet anyways.
>>>>>>
>>>>>> Cameron
>>>>>>
>>>>>>> On Apr 6, 2015, at 2:00 PM, Daniel Berlin <dberlin at dberlin.org> wrote:
>>>>>>>
>>>>>>> Unless i'm missing something, you can't use setvector anyway, because
>>>>>>> it won't remove the non-determinism the sort is there to fix.
>>>>>>>
>>>>>>> That non-determinism comes from the fact that defblocks is a
>>>>>>> smallptrset, and even if the priority queue's order is stable, the
>>>>>>> only guarantee you get is that things at the same dom level end up
>>>>>>> around the same place in the priority queue (but not in a specific
>>>>>>> order).
>>>>>>>
>>>>>>>
>>>>>>> If domtreenodepair was domtreenodetriple (<BasicBlock, DomLevel,
>>>>>>> BBNumbers[BB]>), and sorted by second then third part of tuple,
>>>>>>> dfblocks could be a std::vector and you could use setvector, and not
>>>>>>> sort anything, because it would be deterministic.
>>>>>>>
>>>>>>> Whether any of this is worth it, no idea.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Apr 6, 2015 at 1:56 PM, Cameron Zwarich <zwarich at apple.com>
>>>>>>> wrote:
>>>>>>>> I just thought it would make aesthetic sense to have less code in the
>>>>>>>> inner
>>>>>>>> loop, but that’s arbitrary. If people think the ‘a’ patch makes more
>>>>>>>> sense
>>>>>>>> and it’s equivalent perfwise, then I’m fine with that. Code is made for
>>>>>>>> reading, not writing. :P
>>>>>>>>
>>>>>>>> I’m not sure SetVector makes sense here for a few reasons:
>>>>>>>>
>>>>>>>> 1) The set and the vector store different types; the vector also
>>>>>>>> contains
>>>>>>>> the BB number.
>>>>>>>> 2) The vector in SetVector is a std::vector, not a SmallVector.
>>>>>>>> 3) The set in SetVector is a SmallSet of a fixed size of 16 (smaller
>>>>>>>> than
>>>>>>>> the 32 used here), and the ‘large’ case of SmallSet is a std::set,
>>>>>>>> which is
>>>>>>>> probably slower than SmallPtrSet’s ‘large’ case.
>>>>>>>>
>>>>>>>> Cameron
>>>>>>>>
>>>>>>>> On Apr 6, 2015, at 1:40 PM, Chandler Carruth <chandlerc at google.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> I'd like to understand why Cameron would prefer the 'b' patch to the
>>>>>>>> 'a'
>>>>>>>> patch. AFAICT, the 'b' patch doesn't save any actual memory (sadly).
>>>>>>>>
>>>>>>>> Also, if we go with 'a', would it make sense to use a SetVector rather
>>>>>>>> than
>>>>>>>> a separate vector and set?
>>>>>>>>
>>>>>>>> On Mon, Apr 6, 2015 at 1:36 PM Daniel Berlin <dberlin at dberlin.org>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> I'm running numbers on both approaches now, to see if there is any
>>>>>>>>> real difference in speed.
>>>>>>>>>
>>>>>>>>> (otherwise, i think the one with two visited worklists is easier to
>>>>>>>>> understand, unless someone else wants to disagree :P)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Apr 6, 2015 at 11:54 AM, Quentin Colombet
>>>>>>>>> <qcolombet at apple.com>
>>>>>>>>> wrote:
>>>>>>>>>> Hi Cameron,
>>>>>>>>>>
>>>>>>>>>> Sounds good.
>>>>>>>>>>
>>>>>>>>>> Let us wait the complete testing from Daniel before reviewing the
>>>>>>>>>> patch.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> -Quentin
>>>>>>>>>>> On Apr 6, 2015, at 11:27 AM, Cameron Zwarich <zwarich at apple.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> It was pointed out to me (without any specifics) that the iterated
>>>>>>>>>>> dominance frontier algorithm in PromoteMemoryToRegister.cpp has
>>>>>>>>>>> O(n^2) worst
>>>>>>>>>>> case behavior.
>>>>>>>>>>>
>>>>>>>>>>> I inspected the code and think I found the cause. The code uses a
>>>>>>>>>>> priority queue and a worklist, which share the same visited set, but
>>>>>>>>>>> the
>>>>>>>>>>> visited set is only updated when inserting into the priority queue.
>>>>>>>>>>> The
>>>>>>>>>>> original Sreedhar-Gao paper effectively has a second visited set
>>>>>>>>>>> (the InPhi
>>>>>>>>>>> flag) which is used for the priority queue, and the set called
>>>>>>>>>>> Visited is
>>>>>>>>>>> used for the recursive traversal that is done here with a worklist.
>>>>>>>>>>>
>>>>>>>>>>> I’ve attached two patches, one which just adds a second visited sit,
>>>>>>>>>>> and another which leverages the fact that one of the visited sets is
>>>>>>>>>>> actually the IDF. I would prefer the latter if it has equal
>>>>>>>>>>> performance with
>>>>>>>>>>> the first.
>>>>>>>>>>>
>>>>>>>>>>> They both pass `make check`, but I’m not sure I’ll have time to give
>>>>>>>>>>> these patches the testing they’ll deserve in the next few days.
>>>>>>>>>>> Daniel
>>>>>>>>>>> Berlin has offered to test them more thoroughly for me.
>>>>>>>>>>>
>>>>>>>>>>> Note that there is still one difference with the paper. The paper
>>>>>>>>>>> uses
>>>>>>>>>>> a custom linked data structure instead of a priority queue, which
>>>>>>>>>>> takes
>>>>>>>>>>> advantage of the property that the level of all nodes being inserted
>>>>>>>>>>> is at
>>>>>>>>>>> most the current level. The code in LLVM uses a priority queue based
>>>>>>>>>>> on a
>>>>>>>>>>> binary heap. This means that the worst case is O(n log n), but I’d
>>>>>>>>>>> be
>>>>>>>>>>> surprised if the difference matters in practice.
>>>>>>>>>>>
>>>>>>>>>>> Cameron
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> <idf-faster-a.patch><idf-faster-b.patch>_______________________________________________
>>>>>>>>>>> llvm-commits mailing list
>>>>>>>>>>> llvm-commits at cs.uiuc.edu
>>>>>>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> llvm-commits mailing list
>>>>>>>>> llvm-commits at cs.uiuc.edu
>>>>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> llvm-commits mailing list
>>>>>>>> llvm-commits at cs.uiuc.edu
>>>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: laddergraph.py
Type: text/x-python-script
Size: 594 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20150406/4029ad53/attachment.bin>