[PATCH] Improvements to SSA construction

Mon Apr 6 16:39:34 PDT 2015

How does perf look relative to the current code in the tree?

Cameron

> On Apr 6, 2015, at 3:57 PM, Daniel Berlin <dberlin at dberlin.org> wrote:
> 
> So, i can't find a testcase up to 100k bb's where *anything* matters
> in terms of a vs b.
> I also tried it on 50k functions of 1000 bb's as well.
> 
> In trying to use setvector, i found another set of much simpler
> reasons it won't work:
> 
> 1. We clear worklist every time we pop from the PQ. We can't do this
> with setvector
> 2. We pop values off the worklist, which would erase them from
> setvector, but as implemented, we don't want them revisited anyway.
> 
> IE Right now,  a value inserted into visitedworklist stays there the
> entire function, but a value inserted into worklist is cleared every
> time we pop something off the PQ.
> 
> 
> #1 is pointless (in the sense that the worklist is always empty at the
> point it's called, i assume it's there to do shrink to fit or
> something)
> 
> Unless #2 is a bug, we can't use setvector.
> 
> 
> 
> 
> On Mon, Apr 6, 2015 at 3:17 PM, Daniel Berlin <dberlin at dberlin.org> wrote:
>> I'll time it.
>> 
>> I'll use a GCC testcase that generates fully connected graphs.
>> 
>> (I've already tried 10000+ bb cases with if's and and merges, and
>> can't get a vs b to have any real time difference)
>> 
>> 
>> On Mon, Apr 6, 2015 at 2:04 PM, Chandler Carruth <chandlerc at google.com> wrote:
>>> Sorry, I clearly meant SmallSetVector (which uses SmallVector and
>>> SmallPtrSet).
>>> 
>>> Also, I was not supposing you would touch the blocks thing at all, nor
>>> thinking this would remove any non-determinism. It was just that if you're
>>> pushing onto a vector only when insertion into a set vector succeeds, it
>>> might make sense. There are still plenty of cases where it doesn't, either
>>> due to type mis-matches, or because you want to keep growing the set while
>>> shrinking and re-using space in the vector.
>>> 
>>> On Mon, Apr 6, 2015 at 2:02 PM Cameron Zwarich <zwarich at apple.com> wrote:
>>>> 
>>>> Yeah, that’s a bigger problem. My original complaints 2) and 3) are
>>>> invalid because those are template parameters that could presumedly be
>>>> swapped out for SmallVector / SmallPtrSet anyways.
>>>> 
>>>> Cameron
>>>> 
>>>>> On Apr 6, 2015, at 2:00 PM, Daniel Berlin <dberlin at dberlin.org> wrote:
>>>>> 
>>>>> Unless i'm missing something, you can't use setvector anyway, because
>>>>> it won't remove the non-determinism the sort is there to fix.
>>>>> 
>>>>> That non-determinism comes from the fact that defblocks is a
>>>>> smallptrset, and even if the priority queue's order is stable, the
>>>>> only guarantee you get is that things at the same dom level end up
>>>>> around the same place in the priority queue (but not in a specific
>>>>> order).
>>>>> 
>>>>> 
>>>>> If domtreenodepair was domtreenodetriple (<BasicBlock, DomLevel,
>>>>> BBNumbers[BB]>), and sorted by second then third part of tuple,
>>>>> dfblocks could be a std::vector and you could use setvector, and not
>>>>> sort anything, because it would be deterministic.
>>>>> 
>>>>> Whether any of this is worth it, no idea.
>>>>> 
>>>>> 
>>>>> 
>>>>> On Mon, Apr 6, 2015 at 1:56 PM, Cameron Zwarich <zwarich at apple.com>
>>>>> wrote:
>>>>>> I just thought it would make aesthetic sense to have less code in the
>>>>>> inner
>>>>>> loop, but that’s arbitrary. If people think the ‘a’ patch makes more
>>>>>> sense
>>>>>> and it’s equivalent perfwise, then I’m fine with that. Code is made for
>>>>>> reading, not writing. :P
>>>>>> 
>>>>>> I’m not sure SetVector makes sense here for a few reasons:
>>>>>> 
>>>>>> 1) The set and the vector store different types; the vector also
>>>>>> contains
>>>>>> the BB number.
>>>>>> 2) The vector in SetVector is a std::vector, not a SmallVector.
>>>>>> 3) The set in SetVector is a SmallSet of a fixed size of 16 (smaller
>>>>>> than
>>>>>> the 32 used here), and the ‘large’ case of SmallSet is a std::set,
>>>>>> which is
>>>>>> probably slower than SmallPtrSet’s ‘large’ case.
>>>>>> 
>>>>>> Cameron
>>>>>> 
>>>>>> On Apr 6, 2015, at 1:40 PM, Chandler Carruth <chandlerc at google.com>
>>>>>> wrote:
>>>>>> 
>>>>>> I'd like to understand why Cameron would prefer the 'b' patch to the
>>>>>> 'a'
>>>>>> patch. AFAICT, the 'b' patch doesn't save any actual memory (sadly).
>>>>>> 
>>>>>> Also, if we go with 'a', would it make sense to use a SetVector rather
>>>>>> than
>>>>>> a separate vector and set?
>>>>>> 
>>>>>> On Mon, Apr 6, 2015 at 1:36 PM Daniel Berlin <dberlin at dberlin.org>
>>>>>> wrote:
>>>>>>> 
>>>>>>> I'm running numbers on both approaches now, to see if there is any
>>>>>>> real difference in speed.
>>>>>>> 
>>>>>>> (otherwise, i think the one with two visited worklists is easier to
>>>>>>> understand, unless someone else wants to disagree :P)
>>>>>>> 
>>>>>>> 
>>>>>>> On Mon, Apr 6, 2015 at 11:54 AM, Quentin Colombet
>>>>>>> <qcolombet at apple.com>
>>>>>>> wrote:
>>>>>>>> Hi Cameron,
>>>>>>>> 
>>>>>>>> Sounds good.
>>>>>>>> 
>>>>>>>> Let us wait the complete testing from Daniel before reviewing the
>>>>>>>> patch.
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> -Quentin
>>>>>>>>> On Apr 6, 2015, at 11:27 AM, Cameron Zwarich <zwarich at apple.com>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> It was pointed out to me (without any specifics) that the iterated
>>>>>>>>> dominance frontier algorithm in PromoteMemoryToRegister.cpp has
>>>>>>>>> O(n^2) worst
>>>>>>>>> case behavior.
>>>>>>>>> 
>>>>>>>>> I inspected the code and think I found the cause. The code uses a
>>>>>>>>> priority queue and a worklist, which share the same visited set, but
>>>>>>>>> the
>>>>>>>>> visited set is only updated when inserting into the priority queue.
>>>>>>>>> The
>>>>>>>>> original Sreedhar-Gao paper effectively has a second visited set
>>>>>>>>> (the InPhi
>>>>>>>>> flag) which is used for the priority queue, and the set called
>>>>>>>>> Visited is
>>>>>>>>> used for the recursive traversal that is done here with a worklist.
>>>>>>>>> 
>>>>>>>>> I’ve attached two patches, one which just adds a second visited sit,
>>>>>>>>> and another which leverages the fact that one of the visited sets is
>>>>>>>>> actually the IDF. I would prefer the latter if it has equal
>>>>>>>>> performance with
>>>>>>>>> the first.
>>>>>>>>> 
>>>>>>>>> They both pass `make check`, but I’m not sure I’ll have time to give
>>>>>>>>> these patches the testing they’ll deserve in the next few days.
>>>>>>>>> Daniel
>>>>>>>>> Berlin has offered to test them more thoroughly for me.
>>>>>>>>> 
>>>>>>>>> Note that there is still one difference with the paper. The paper
>>>>>>>>> uses
>>>>>>>>> a custom linked data structure instead of a priority queue, which
>>>>>>>>> takes
>>>>>>>>> advantage of the property that the level of all nodes being inserted
>>>>>>>>> is at
>>>>>>>>> most the current level. The code in LLVM uses a priority queue based
>>>>>>>>> on a
>>>>>>>>> binary heap. This means that the worst case is O(n log n), but I’d
>>>>>>>>> be
>>>>>>>>> surprised if the difference matters in practice.
>>>>>>>>> 
>>>>>>>>> Cameron
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> <idf-faster-a.patch><idf-faster-b.patch>_______________________________________________
>>>>>>>>> llvm-commits mailing list
>>>>>>>>> llvm-commits at cs.uiuc.edu
>>>>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>>>>>>> 
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> llvm-commits mailing list
>>>>>>> llvm-commits at cs.uiuc.edu
>>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>>>>> 
>>>>>> _______________________________________________
>>>>>> llvm-commits mailing list
>>>>>> llvm-commits at cs.uiuc.edu
>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>>>>> 
>>>>>> 
>>>> 
>>>