[llvm] r314435 - [JumpThreading] Preserve DT and LVI across the pass

Fri Oct 13 20:10:40 PDT 2017

(and just for future info in case anyone stares at this thread: because of
what dominator tree properties they can each screw up, edge inserts are
generally an order of magnitude faster to handle than edge deletions, i
don't think you'd have this issue at all if the updates were insertions)

On Fri, Oct 13, 2017 at 1:51 PM, Daniel Berlin <dberlin at dberlin.org> wrote:

> Sounds good. I want to make sure you feel like you can make forward
> progress as quickly as possible (I know starting to fix all this has been a
> drag), so if you want to do it earlier, let me know and we'll figure
> something out :)
>
>
> On Fri, Oct 13, 2017 at 1:44 PM, Brian M. Rzycki <brzycki at gmail.com>
> wrote:
>
>> Ah gotcha, that makes sense.  As for the basic updater I would appreciate
>> to see your prototype. We can discuss this in person at the LLVM conference
>> too.
>>
>> Thanks Daniel.
>> -Brian
>>
>> On Fri, Oct 13, 2017 at 3:37 PM, Daniel Berlin <dberlin at dberlin.org>
>> wrote:
>>
>>> The problem with adding it directly to DT is that then you have to
>>> update PDT and DT separately with different calls for no reason.
>>>
>>> I think you would be better off creating an updater class that contains
>>> DT and the vector and pass it around.
>>> Then we can just add PDT to the updater object and everything will
>>> update PDT as well with no extra work.
>>> I'll send you a basic updater object patch if you like
>>>
>>> On Fri, Oct 13, 2017, 4:28 PM Brian M. Rzycki <brzycki at gmail.com> wrote:
>>>
>>>> Daniel, thank you for the explanation. I completely agree if we can
>>>> batch everything then it's a net win. Where I have concerns is in the
>>>> gritty details of the function calls and how they interface with each
>>>> other. While waiting on your response I started to look at just one place
>>>> where I would need to do batch updates.
>>>>
>>>> JumpThreadingPass::runImpl()
>>>>   llvm::removeUnreachableBlocks()
>>>>     static markAliveBlocks()
>>>>       llvm::changeToUnreachable()
>>>>
>>>> In this example I start batching DT requests for
>>>> removeUnreachableBlocks. Because this function lives in Local.cpp, is a
>>>> public interface, and removes blocks I decided this routine would isolate
>>>> its own batch updates. This works well until I get to the call to
>>>> changeToUnreachable(). CTU alters the state of the CFG and I need to know
>>>> the changes for DT updates. I now have the following options:
>>>>
>>>> 1. change the public api and all its call-sites
>>>> 2. attempt to use optional parameters and conditionally batch DT updates
>>>> 3. view this function call as a synchronization event and force
>>>> applyUpdates before calling.
>>>>
>>>> For (1) this is error-prone and opens up lots of fun debug issues. I'd
>>>> prefer to not go down this path.
>>>> For (3) we lose the whole point of doing batch updates in the first
>>>> place.
>>>> For (2) I have to add three new optional variables: DT, a Vector
>>>> reference, and a flag indicating batch mode.  This will work but it's
>>>> clunky.
>>>>
>>>> Note that this isn't the only place I'll have to make this compromise.
>>>> Here's another that makes me a lot more nervous:
>>>> JumpThreadingPass::SPlitBlockPredecessors()
>>>>   SplitBlockPredecessors()
>>>>
>>>> In this case SplitBlockPredecessors() does have DT support using the
>>>> older API. It's defined in BasicBlockUtils.cpp. Again, I'd like to avoid
>>>> option (1) and doing option (2) means I have to update the DT inside the
>>>> routine to the new API as well as pass 2 optional parameters. In this case
>>>> (3) makes the most sense, but I have no idea at what performance cost.
>>>>
>>>> I am willing to do the work but it's going to take time (and a bit of
>>>> guidance from others on how they want this interface to involve).
>>>>
>>>> There's another issue: every edge that is updated via batch has to be
>>>> checked to make sure it's not already in the list of updates. This means as
>>>> I insert into Updates I have to do the find() != idiom every edge. This is
>>>> O(n). As the size of the batch list gets larger that time becomes less
>>>> trivial.
>>>>
>>>> Reasons like these are why I was suggesting an alternative: have the
>>>> pending queue of requests live inside the DominatorTree class and use
>>>> methods on DT to enqueue or flush. It also has the benefit of being useful
>>>> to others outside of JumpThreading and Local that will eventually want to
>>>> use the new DT interface. Any call to methods other than the following
>>>> automatically cause a flush of the pending queue:
>>>> deleteEdgePend()
>>>> insertEdgePend()
>>>> applyUpdatesPend()
>>>>
>>>> It would also be nice if the pending inserts didn't care about
>>>> duplicate edge update requests. That would make the call site much simpler
>>>> and move the work to a single scrub of the list just before processing it.
>>>>
>>>> Guidance on how to proceed is greatly appreciated. Frankly I've been
>>>> fighting dominators inb JT now for about 4 months and I really just want to
>>>> move onto the more productive work of improving the code generation in
>>>> JumpThreading. :)
>>>>
>>>> Thanks,
>>>> -Brian
>>>>
>>>>
>>>> On Fri, Oct 13, 2017 at 2:38 PM, Daniel Berlin <dberlin at dberlin.org>
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Fri, Oct 13, 2017 at 10:48 AM, Brian M. Rzycki <brzycki at gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I *believe* the motivation is that they want DT up to date *in-pass*
>>>>>>> so they can use it do to better jump threading over loops.
>>>>>>
>>>>>> Daniel, you are correct in this assumption. Dominance is the first
>>>>>> step in enhancing the kinds of analysis and extend the number of
>>>>>> Jumpthreading opportunities. Analysis of loops is part of this work.
>>>>>>
>>>>>> *that* kind of change, IMHO, should be subject to our usual "how much
>>>>>>> faster code does it make vs how much does it cost" tradeoff.
>>>>>>
>>>>>> I have not started the work on actual code optimizations simply
>>>>>> because I could not intelligently make the analysis decisions needed.
>>>>>> Rather than create a huge change to JT (or another pass) Sebastian and I
>>>>>> decided to incrementally update JT with the features we needed to get us to
>>>>>> the end goal.
>>>>>>
>>>>>> This was *our* plan to remove the recalculates, but it does not
>>>>>>> appear to have be what Sebastian/etc want for JT.
>>>>>>
>>>>>> I'm a bit confused by this statement: that's what the patch is doing.
>>>>>>
>>>>>
>>>>> You are confused because i was talking about the recalculations that
>>>>> occur elsewhere.
>>>>>
>>>>> But let me try to diagram it for you.
>>>>>
>>>>> Here is the equivalent of what JT did before your patch
>>>>>
>>>>>
>>>>> iterate
>>>>>  for each bb
>>>>>     maybe change an edge and some code
>>>>>  call some utilities
>>>>>  for each dead block
>>>>>     erase it
>>>>> DT->recalculate
>>>>>
>>>>> Here is what your patch would look like without the incremental API
>>>>> existing:
>>>>> iterate
>>>>>   for each bb
>>>>>    maybe change an edge and some code
>>>>>    for each edge we changed:
>>>>>      DT->recalculate
>>>>>   call some utilities, calling DT->recalculate in each one
>>>>>  for each dead block
>>>>>     erase it
>>>>>     DT->recalculate
>>>>>
>>>>> You can see that this has added a O(number of changes )
>>>>> DT->recalculate calls compared to what happened before.
>>>>>
>>>>> That would be ridiculously slow. Probably 100x-1000x as slow as the
>>>>> first case.
>>>>>
>>>>> You then have changed that O(number of changes) set of calls to use
>>>>> the incremental API, so that it looks like this:
>>>>> iterate
>>>>>   for each bb
>>>>>    maybe change an edge and some code
>>>>>    for each edge we changed:
>>>>>      incremental update DT
>>>>>   call some utilities, calling incremental update DT in each one
>>>>>  for each dead block
>>>>>     erase it
>>>>>     incremental update DT
>>>>>
>>>>> These calls, unfortunately are not O(1) and will never be O(1) (doing
>>>>> so is provably impossible)
>>>>> So it's not an O(N) cost you've added it, it's N^2 or N^3 in the worst
>>>>> case.
>>>>>
>>>>>
>>>>> There's only one corner-case where a recalculate occurs. The rest of
>>>>>> the time it's incremental updates.  There may still be opportunities to
>>>>>> batch updates but I honestly have no idea if it will be more efficient.
>>>>>>
>>>>>
>>>>> It will, because it will cause less updating in practice.
>>>>>
>>>>> If jump threading changes edge A->B to A->C (IE insert new edge,
>>>>> remove old edge)  and then changes edge A->C to A->D , you will perform two
>>>>> updates.
>>>>> If those require full recalculation of the tree (internal to the
>>>>> updater), you will update the tree twice.
>>>>> Batching will perform one.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20171013/cfc12df0/attachment.html>