[PATCH] Improvements to SSA construction
Cameron Zwarich
zwarich at apple.com
Mon Apr 6 18:02:06 PDT 2015
Thanks. If someone reviews the content of my 'a' patch then I'll land it.
Cameron
> On Apr 6, 2015, at 4:59 PM, Daniel Berlin <dberlin at dberlin.org> wrote:
>
> So, to drive worst case behavior (besides all the other testing), i
> wrote the attached to generate ladder graphs of size N, which are the
> worst case for DF/IDF algorithms (N^2)
>
> If you want perfect ladder looking graphs, you have to use
> -simplifycfg on the result.
> Note: Ladder graphs seems to generate at least n^2 behavior in
> GraphWriter, so using a 50000 rung ladder graph doesn't seem sane with
> -view-cfg. It doesn't use n^2 memory at least, but i stopped it
> after 15 minutes.
>
> Anyway, as one might expect, even something like using -domfrontier on
> a 50000 rung ladder graph is not possible.
>
> The same is true for the algorithm before your patches.
> Before your patches, i stopped it at 5 minutes.
> After your patches, it takes <50ms.
>
> This is tested with memoryssa, which only builds SSA for one variable,
> and does no pruning, which is very easy to test against.
> It's safe to say "it's always an improvement" there
>
> For promotememorytoregisters, besides generating code that defeats the
> pruning, I tested across a ton of different types of code for both of
> these, and can't find a single case where it's slower.
>
>
>> On Mon, Apr 6, 2015 at 4:39 PM, Cameron Zwarich <zwarich at apple.com> wrote:
>> How does perf look relative to the current code in the tree?
>>
>> Cameron
>>
>>> On Apr 6, 2015, at 3:57 PM, Daniel Berlin <dberlin at dberlin.org> wrote:
>>>
>>> So, i can't find a testcase up to 100k bb's where *anything* matters
>>> in terms of a vs b.
>>> I also tried it on 50k functions of 1000 bb's as well.
>>>
>>> In trying to use setvector, i found another set of much simpler
>>> reasons it won't work:
>>>
>>> 1. We clear worklist every time we pop from the PQ. We can't do this
>>> with setvector
>>> 2. We pop values off the worklist, which would erase them from
>>> setvector, but as implemented, we don't want them revisited anyway.
>>>
>>> IE Right now, a value inserted into visitedworklist stays there the
>>> entire function, but a value inserted into worklist is cleared every
>>> time we pop something off the PQ.
>>>
>>>
>>> #1 is pointless (in the sense that the worklist is always empty at the
>>> point it's called, i assume it's there to do shrink to fit or
>>> something)
>>>
>>> Unless #2 is a bug, we can't use setvector.
>>>
>>>
>>>
>>>
>>>> On Mon, Apr 6, 2015 at 3:17 PM, Daniel Berlin <dberlin at dberlin.org> wrote:
>>>> I'll time it.
>>>>
>>>> I'll use a GCC testcase that generates fully connected graphs.
>>>>
>>>> (I've already tried 10000+ bb cases with if's and and merges, and
>>>> can't get a vs b to have any real time difference)
>>>>
>>>>
>>>>> On Mon, Apr 6, 2015 at 2:04 PM, Chandler Carruth <chandlerc at google.com> wrote:
>>>>> Sorry, I clearly meant SmallSetVector (which uses SmallVector and
>>>>> SmallPtrSet).
>>>>>
>>>>> Also, I was not supposing you would touch the blocks thing at all, nor
>>>>> thinking this would remove any non-determinism. It was just that if you're
>>>>> pushing onto a vector only when insertion into a set vector succeeds, it
>>>>> might make sense. There are still plenty of cases where it doesn't, either
>>>>> due to type mis-matches, or because you want to keep growing the set while
>>>>> shrinking and re-using space in the vector.
>>>>>
>>>>>> On Mon, Apr 6, 2015 at 2:02 PM Cameron Zwarich <zwarich at apple.com> wrote:
>>>>>>
>>>>>> Yeah, that’s a bigger problem. My original complaints 2) and 3) are
>>>>>> invalid because those are template parameters that could presumedly be
>>>>>> swapped out for SmallVector / SmallPtrSet anyways.
>>>>>>
>>>>>> Cameron
>>>>>>
>>>>>>> On Apr 6, 2015, at 2:00 PM, Daniel Berlin <dberlin at dberlin.org> wrote:
>>>>>>>
>>>>>>> Unless i'm missing something, you can't use setvector anyway, because
>>>>>>> it won't remove the non-determinism the sort is there to fix.
>>>>>>>
>>>>>>> That non-determinism comes from the fact that defblocks is a
>>>>>>> smallptrset, and even if the priority queue's order is stable, the
>>>>>>> only guarantee you get is that things at the same dom level end up
>>>>>>> around the same place in the priority queue (but not in a specific
>>>>>>> order).
>>>>>>>
>>>>>>>
>>>>>>> If domtreenodepair was domtreenodetriple (<BasicBlock, DomLevel,
>>>>>>> BBNumbers[BB]>), and sorted by second then third part of tuple,
>>>>>>> dfblocks could be a std::vector and you could use setvector, and not
>>>>>>> sort anything, because it would be deterministic.
>>>>>>>
>>>>>>> Whether any of this is worth it, no idea.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Apr 6, 2015 at 1:56 PM, Cameron Zwarich <zwarich at apple.com>
>>>>>>> wrote:
>>>>>>>> I just thought it would make aesthetic sense to have less code in the
>>>>>>>> inner
>>>>>>>> loop, but that’s arbitrary. If people think the ‘a’ patch makes more
>>>>>>>> sense
>>>>>>>> and it’s equivalent perfwise, then I’m fine with that. Code is made for
>>>>>>>> reading, not writing. :P
>>>>>>>>
>>>>>>>> I’m not sure SetVector makes sense here for a few reasons:
>>>>>>>>
>>>>>>>> 1) The set and the vector store different types; the vector also
>>>>>>>> contains
>>>>>>>> the BB number.
>>>>>>>> 2) The vector in SetVector is a std::vector, not a SmallVector.
>>>>>>>> 3) The set in SetVector is a SmallSet of a fixed size of 16 (smaller
>>>>>>>> than
>>>>>>>> the 32 used here), and the ‘large’ case of SmallSet is a std::set,
>>>>>>>> which is
>>>>>>>> probably slower than SmallPtrSet’s ‘large’ case.
>>>>>>>>
>>>>>>>> Cameron
>>>>>>>>
>>>>>>>> On Apr 6, 2015, at 1:40 PM, Chandler Carruth <chandlerc at google.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> I'd like to understand why Cameron would prefer the 'b' patch to the
>>>>>>>> 'a'
>>>>>>>> patch. AFAICT, the 'b' patch doesn't save any actual memory (sadly).
>>>>>>>>
>>>>>>>> Also, if we go with 'a', would it make sense to use a SetVector rather
>>>>>>>> than
>>>>>>>> a separate vector and set?
>>>>>>>>
>>>>>>>> On Mon, Apr 6, 2015 at 1:36 PM Daniel Berlin <dberlin at dberlin.org>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> I'm running numbers on both approaches now, to see if there is any
>>>>>>>>> real difference in speed.
>>>>>>>>>
>>>>>>>>> (otherwise, i think the one with two visited worklists is easier to
>>>>>>>>> understand, unless someone else wants to disagree :P)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Apr 6, 2015 at 11:54 AM, Quentin Colombet
>>>>>>>>> <qcolombet at apple.com>
>>>>>>>>> wrote:
>>>>>>>>>> Hi Cameron,
>>>>>>>>>>
>>>>>>>>>> Sounds good.
>>>>>>>>>>
>>>>>>>>>> Let us wait the complete testing from Daniel before reviewing the
>>>>>>>>>> patch.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> -Quentin
>>>>>>>>>>> On Apr 6, 2015, at 11:27 AM, Cameron Zwarich <zwarich at apple.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> It was pointed out to me (without any specifics) that the iterated
>>>>>>>>>>> dominance frontier algorithm in PromoteMemoryToRegister.cpp has
>>>>>>>>>>> O(n^2) worst
>>>>>>>>>>> case behavior.
>>>>>>>>>>>
>>>>>>>>>>> I inspected the code and think I found the cause. The code uses a
>>>>>>>>>>> priority queue and a worklist, which share the same visited set, but
>>>>>>>>>>> the
>>>>>>>>>>> visited set is only updated when inserting into the priority queue.
>>>>>>>>>>> The
>>>>>>>>>>> original Sreedhar-Gao paper effectively has a second visited set
>>>>>>>>>>> (the InPhi
>>>>>>>>>>> flag) which is used for the priority queue, and the set called
>>>>>>>>>>> Visited is
>>>>>>>>>>> used for the recursive traversal that is done here with a worklist.
>>>>>>>>>>>
>>>>>>>>>>> I’ve attached two patches, one which just adds a second visited sit,
>>>>>>>>>>> and another which leverages the fact that one of the visited sets is
>>>>>>>>>>> actually the IDF. I would prefer the latter if it has equal
>>>>>>>>>>> performance with
>>>>>>>>>>> the first.
>>>>>>>>>>>
>>>>>>>>>>> They both pass `make check`, but I’m not sure I’ll have time to give
>>>>>>>>>>> these patches the testing they’ll deserve in the next few days.
>>>>>>>>>>> Daniel
>>>>>>>>>>> Berlin has offered to test them more thoroughly for me.
>>>>>>>>>>>
>>>>>>>>>>> Note that there is still one difference with the paper. The paper
>>>>>>>>>>> uses
>>>>>>>>>>> a custom linked data structure instead of a priority queue, which
>>>>>>>>>>> takes
>>>>>>>>>>> advantage of the property that the level of all nodes being inserted
>>>>>>>>>>> is at
>>>>>>>>>>> most the current level. The code in LLVM uses a priority queue based
>>>>>>>>>>> on a
>>>>>>>>>>> binary heap. This means that the worst case is O(n log n), but I’d
>>>>>>>>>>> be
>>>>>>>>>>> surprised if the difference matters in practice.
>>>>>>>>>>>
>>>>>>>>>>> Cameron
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> <idf-faster-a.patch><idf-faster-b.patch>_______________________________________________
>>>>>>>>>>> llvm-commits mailing list
>>>>>>>>>>> llvm-commits at cs.uiuc.edu
>>>>>>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> llvm-commits mailing list
>>>>>>>>> llvm-commits at cs.uiuc.edu
>>>>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> llvm-commits mailing list
>>>>>>>> llvm-commits at cs.uiuc.edu
>>>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> <laddergraph.py>
More information about the llvm-commits
mailing list