[llvm] r314435 - [JumpThreading] Preserve DT and LVI across the pass

Jakub (Kuba) Kuderski via llvm-commits llvm-commits at lists.llvm.org
Mon Oct 16 08:31:45 PDT 2017


Hi Brian,


>  I was thinking about this over the weekend and I was thinking of a new
> class named DeferredDeviceTree or DeviceTreeDeferred. It would have the
> following methods:
> deleteEdge()
> insertEdge()
> appyUpdates(Updates)
> sync()
>

s/Device/Dominator? :P

My first observation would be that we don't really need deleteEdge and
insertEdge, as the applyUpdates can now perform a non-batching update when
there is only one qnqueued -- I think that applyUpdates is enough, end with
initializer list passed is not that verbose (Sth.applyUpdates({DT::Insert,
BB_A, BB_B}). Instead, it would be nice to have some automatic way of
handling current IR API like replaceAllUsesWith that is cumbersome with the
current CFG-level update logic.

Then, I think that it might be also possible to deffer updates until
someone calls DT.dominates(...) or similar function that need DT to be
updated, so maybe it would make sense to make the interface lazy all the
way and not expose the synchronization at all? Or only provide
synchronization for debugging.
The problem I see with this approach is that it would not immediately
obvious how to be smart enough not to store invalid pointers to already
deleted basic blocks. With your explicit synchronization, the burden of
doing so is shifted to the user side.

On Mon, Oct 16, 2017 at 10:48 AM, Brian M. Rzycki <brzycki at gmail.com> wrote:

> Sounds good. I want to make sure you feel like you can make forward
>> progress as quickly as possible (I know starting to fix all this has been a
>> drag), so if you want to do it earlier, let me know and we'll figure
>> something out :)
>
> Thanks Daniel for all the help. If it needs more work to be done right
> then I'll get on it. I want to make sure we get it right. :)
>
>  I was thinking about this over the weekend and I was thinking of a new
> class named DeferredDeviceTree or DeviceTreeDeferred. It would have the
> following methods:
> deleteEdge()
> insertEdge()
> appyUpdates(Updates)
> sync()
>
> Does this model align with how you were envisioning this new class? Once
> this is defined I can pass this around and flush on demand.
>
> Thanks,
> -Brian
>
> On Fri, Oct 13, 2017 at 3:51 PM, Daniel Berlin <dberlin at dberlin.org>
> wrote:
>
>> Sounds good. I want to make sure you feel like you can make forward
>> progress as quickly as possible (I know starting to fix all this has been a
>> drag), so if you want to do it earlier, let me know and we'll figure
>> something out :)
>>
>>
>> On Fri, Oct 13, 2017 at 1:44 PM, Brian M. Rzycki <brzycki at gmail.com>
>> wrote:
>>
>>> Ah gotcha, that makes sense.  As for the basic updater I would
>>> appreciate to see your prototype. We can discuss this in person at the LLVM
>>> conference too.
>>>
>>> Thanks Daniel.
>>> -Brian
>>>
>>> On Fri, Oct 13, 2017 at 3:37 PM, Daniel Berlin <dberlin at dberlin.org>
>>> wrote:
>>>
>>>> The problem with adding it directly to DT is that then you have to
>>>> update PDT and DT separately with different calls for no reason.
>>>>
>>>> I think you would be better off creating an updater class that contains
>>>> DT and the vector and pass it around.
>>>> Then we can just add PDT to the updater object and everything will
>>>> update PDT as well with no extra work.
>>>> I'll send you a basic updater object patch if you like
>>>>
>>>> On Fri, Oct 13, 2017, 4:28 PM Brian M. Rzycki <brzycki at gmail.com>
>>>> wrote:
>>>>
>>>>> Daniel, thank you for the explanation. I completely agree if we can
>>>>> batch everything then it's a net win. Where I have concerns is in the
>>>>> gritty details of the function calls and how they interface with each
>>>>> other. While waiting on your response I started to look at just one place
>>>>> where I would need to do batch updates.
>>>>>
>>>>> JumpThreadingPass::runImpl()
>>>>>   llvm::removeUnreachableBlocks()
>>>>>     static markAliveBlocks()
>>>>>       llvm::changeToUnreachable()
>>>>>
>>>>> In this example I start batching DT requests for
>>>>> removeUnreachableBlocks. Because this function lives in Local.cpp, is a
>>>>> public interface, and removes blocks I decided this routine would isolate
>>>>> its own batch updates. This works well until I get to the call to
>>>>> changeToUnreachable(). CTU alters the state of the CFG and I need to know
>>>>> the changes for DT updates. I now have the following options:
>>>>>
>>>>> 1. change the public api and all its call-sites
>>>>> 2. attempt to use optional parameters and conditionally batch DT
>>>>> updates
>>>>> 3. view this function call as a synchronization event and force
>>>>> applyUpdates before calling.
>>>>>
>>>>> For (1) this is error-prone and opens up lots of fun debug issues. I'd
>>>>> prefer to not go down this path.
>>>>> For (3) we lose the whole point of doing batch updates in the first
>>>>> place.
>>>>> For (2) I have to add three new optional variables: DT, a Vector
>>>>> reference, and a flag indicating batch mode.  This will work but it's
>>>>> clunky.
>>>>>
>>>>> Note that this isn't the only place I'll have to make this
>>>>> compromise.  Here's another that makes me a lot more nervous:
>>>>> JumpThreadingPass::SPlitBlockPredecessors()
>>>>>   SplitBlockPredecessors()
>>>>>
>>>>> In this case SplitBlockPredecessors() does have DT support using the
>>>>> older API. It's defined in BasicBlockUtils.cpp. Again, I'd like to avoid
>>>>> option (1) and doing option (2) means I have to update the DT inside the
>>>>> routine to the new API as well as pass 2 optional parameters. In this case
>>>>> (3) makes the most sense, but I have no idea at what performance cost.
>>>>>
>>>>> I am willing to do the work but it's going to take time (and a bit of
>>>>> guidance from others on how they want this interface to involve).
>>>>>
>>>>> There's another issue: every edge that is updated via batch has to be
>>>>> checked to make sure it's not already in the list of updates. This means as
>>>>> I insert into Updates I have to do the find() != idiom every edge. This is
>>>>> O(n). As the size of the batch list gets larger that time becomes less
>>>>> trivial.
>>>>>
>>>>> Reasons like these are why I was suggesting an alternative: have the
>>>>> pending queue of requests live inside the DominatorTree class and use
>>>>> methods on DT to enqueue or flush. It also has the benefit of being useful
>>>>> to others outside of JumpThreading and Local that will eventually want to
>>>>> use the new DT interface. Any call to methods other than the following
>>>>> automatically cause a flush of the pending queue:
>>>>> deleteEdgePend()
>>>>> insertEdgePend()
>>>>> applyUpdatesPend()
>>>>>
>>>>> It would also be nice if the pending inserts didn't care about
>>>>> duplicate edge update requests. That would make the call site much simpler
>>>>> and move the work to a single scrub of the list just before processing it.
>>>>>
>>>>> Guidance on how to proceed is greatly appreciated. Frankly I've been
>>>>> fighting dominators inb JT now for about 4 months and I really just want to
>>>>> move onto the more productive work of improving the code generation in
>>>>> JumpThreading. :)
>>>>>
>>>>> Thanks,
>>>>> -Brian
>>>>>
>>>>>
>>>>> On Fri, Oct 13, 2017 at 2:38 PM, Daniel Berlin <dberlin at dberlin.org>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Oct 13, 2017 at 10:48 AM, Brian M. Rzycki <brzycki at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I *believe* the motivation is that they want DT up to date *in-pass*
>>>>>>>> so they can use it do to better jump threading over loops.
>>>>>>>
>>>>>>> Daniel, you are correct in this assumption. Dominance is the first
>>>>>>> step in enhancing the kinds of analysis and extend the number of
>>>>>>> Jumpthreading opportunities. Analysis of loops is part of this work.
>>>>>>>
>>>>>>> *that* kind of change, IMHO, should be subject to our usual "how
>>>>>>>> much faster code does it make vs how much does it cost" tradeoff.
>>>>>>>
>>>>>>> I have not started the work on actual code optimizations simply
>>>>>>> because I could not intelligently make the analysis decisions needed.
>>>>>>> Rather than create a huge change to JT (or another pass) Sebastian and I
>>>>>>> decided to incrementally update JT with the features we needed to get us to
>>>>>>> the end goal.
>>>>>>>
>>>>>>> This was *our* plan to remove the recalculates, but it does not
>>>>>>>> appear to have be what Sebastian/etc want for JT.
>>>>>>>
>>>>>>> I'm a bit confused by this statement: that's what the patch is
>>>>>>> doing.
>>>>>>>
>>>>>>
>>>>>> You are confused because i was talking about the recalculations that
>>>>>> occur elsewhere.
>>>>>>
>>>>>> But let me try to diagram it for you.
>>>>>>
>>>>>> Here is the equivalent of what JT did before your patch
>>>>>>
>>>>>>
>>>>>> iterate
>>>>>>  for each bb
>>>>>>     maybe change an edge and some code
>>>>>>  call some utilities
>>>>>>  for each dead block
>>>>>>     erase it
>>>>>> DT->recalculate
>>>>>>
>>>>>> Here is what your patch would look like without the incremental API
>>>>>> existing:
>>>>>> iterate
>>>>>>   for each bb
>>>>>>    maybe change an edge and some code
>>>>>>    for each edge we changed:
>>>>>>      DT->recalculate
>>>>>>   call some utilities, calling DT->recalculate in each one
>>>>>>  for each dead block
>>>>>>     erase it
>>>>>>     DT->recalculate
>>>>>>
>>>>>> You can see that this has added a O(number of changes )
>>>>>> DT->recalculate calls compared to what happened before.
>>>>>>
>>>>>> That would be ridiculously slow. Probably 100x-1000x as slow as the
>>>>>> first case.
>>>>>>
>>>>>> You then have changed that O(number of changes) set of calls to use
>>>>>> the incremental API, so that it looks like this:
>>>>>> iterate
>>>>>>   for each bb
>>>>>>    maybe change an edge and some code
>>>>>>    for each edge we changed:
>>>>>>      incremental update DT
>>>>>>   call some utilities, calling incremental update DT in each one
>>>>>>  for each dead block
>>>>>>     erase it
>>>>>>     incremental update DT
>>>>>>
>>>>>> These calls, unfortunately are not O(1) and will never be O(1) (doing
>>>>>> so is provably impossible)
>>>>>> So it's not an O(N) cost you've added it, it's N^2 or N^3 in the
>>>>>> worst case.
>>>>>>
>>>>>>
>>>>>> There's only one corner-case where a recalculate occurs. The rest of
>>>>>>> the time it's incremental updates.  There may still be opportunities to
>>>>>>> batch updates but I honestly have no idea if it will be more efficient.
>>>>>>>
>>>>>>
>>>>>> It will, because it will cause less updating in practice.
>>>>>>
>>>>>> If jump threading changes edge A->B to A->C (IE insert new edge,
>>>>>> remove old edge)  and then changes edge A->C to A->D , you will perform two
>>>>>> updates.
>>>>>> If those require full recalculation of the tree (internal to the
>>>>>> updater), you will update the tree twice.
>>>>>> Batching will perform one.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>
>>
>


-- 
Jakub Kuderski
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20171016/8e69448c/attachment.html>


More information about the llvm-commits mailing list