[llvm] r314435 - [JumpThreading] Preserve DT and LVI across the pass

Fri Oct 13 13:28:08 PDT 2017

Daniel, thank you for the explanation. I completely agree if we can batch
everything then it's a net win. Where I have concerns is in the gritty
details of the function calls and how they interface with each other. While
waiting on your response I started to look at just one place where I would
need to do batch updates.

JumpThreadingPass::runImpl()
  llvm::removeUnreachableBlocks()
    static markAliveBlocks()
      llvm::changeToUnreachable()

In this example I start batching DT requests for removeUnreachableBlocks.
Because this function lives in Local.cpp, is a public interface, and
removes blocks I decided this routine would isolate its own batch updates.
This works well until I get to the call to changeToUnreachable(). CTU
alters the state of the CFG and I need to know the changes for DT updates.
I now have the following options:

1. change the public api and all its call-sites
2. attempt to use optional parameters and conditionally batch DT updates
3. view this function call as a synchronization event and force
applyUpdates before calling.

For (1) this is error-prone and opens up lots of fun debug issues. I'd
prefer to not go down this path.
For (3) we lose the whole point of doing batch updates in the first place.
For (2) I have to add three new optional variables: DT, a Vector reference,
and a flag indicating batch mode.  This will work but it's clunky.

Note that this isn't the only place I'll have to make this compromise.
Here's another that makes me a lot more nervous:
JumpThreadingPass::SPlitBlockPredecessors()
  SplitBlockPredecessors()

In this case SplitBlockPredecessors() does have DT support using the older
API. It's defined in BasicBlockUtils.cpp. Again, I'd like to avoid option
(1) and doing option (2) means I have to update the DT inside the routine
to the new API as well as pass 2 optional parameters. In this case (3)
makes the most sense, but I have no idea at what performance cost.

I am willing to do the work but it's going to take time (and a bit of
guidance from others on how they want this interface to involve).

There's another issue: every edge that is updated via batch has to be
checked to make sure it's not already in the list of updates. This means as
I insert into Updates I have to do the find() != idiom every edge. This is
O(n). As the size of the batch list gets larger that time becomes less
trivial.

Reasons like these are why I was suggesting an alternative: have the
pending queue of requests live inside the DominatorTree class and use
methods on DT to enqueue or flush. It also has the benefit of being useful
to others outside of JumpThreading and Local that will eventually want to
use the new DT interface. Any call to methods other than the following
automatically cause a flush of the pending queue:
deleteEdgePend()
insertEdgePend()
applyUpdatesPend()

It would also be nice if the pending inserts didn't care about duplicate
edge update requests. That would make the call site much simpler and move
the work to a single scrub of the list just before processing it.

Guidance on how to proceed is greatly appreciated. Frankly I've been
fighting dominators inb JT now for about 4 months and I really just want to
move onto the more productive work of improving the code generation in
JumpThreading. :)

Thanks,
-Brian

On Fri, Oct 13, 2017 at 2:38 PM, Daniel Berlin <dberlin at dberlin.org> wrote:

>
>
> On Fri, Oct 13, 2017 at 10:48 AM, Brian M. Rzycki <brzycki at gmail.com>
> wrote:
>
>> I *believe* the motivation is that they want DT up to date *in-pass* so
>>> they can use it do to better jump threading over loops.
>>
>> Daniel, you are correct in this assumption. Dominance is the first step
>> in enhancing the kinds of analysis and extend the number of Jumpthreading
>> opportunities. Analysis of loops is part of this work.
>>
>> *that* kind of change, IMHO, should be subject to our usual "how much
>>> faster code does it make vs how much does it cost" tradeoff.
>>
>> I have not started the work on actual code optimizations simply because I
>> could not intelligently make the analysis decisions needed. Rather than
>> create a huge change to JT (or another pass) Sebastian and I decided to
>> incrementally update JT with the features we needed to get us to the end
>> goal.
>>
>> This was *our* plan to remove the recalculates, but it does not appear to
>>> have be what Sebastian/etc want for JT.
>>
>> I'm a bit confused by this statement: that's what the patch is doing.
>>
>
> You are confused because i was talking about the recalculations that occur
> elsewhere.
>
> But let me try to diagram it for you.
>
> Here is the equivalent of what JT did before your patch
>
>
> iterate
>  for each bb
>     maybe change an edge and some code
>  call some utilities
>  for each dead block
>     erase it
> DT->recalculate
>
> Here is what your patch would look like without the incremental API
> existing:
> iterate
>   for each bb
>    maybe change an edge and some code
>    for each edge we changed:
>      DT->recalculate
>   call some utilities, calling DT->recalculate in each one
>  for each dead block
>     erase it
>     DT->recalculate
>
> You can see that this has added a O(number of changes ) DT->recalculate
> calls compared to what happened before.
>
> That would be ridiculously slow. Probably 100x-1000x as slow as the first
> case.
>
> You then have changed that O(number of changes) set of calls to use the
> incremental API, so that it looks like this:
> iterate
>   for each bb
>    maybe change an edge and some code
>    for each edge we changed:
>      incremental update DT
>   call some utilities, calling incremental update DT in each one
>  for each dead block
>     erase it
>     incremental update DT
>
> These calls, unfortunately are not O(1) and will never be O(1) (doing so
> is provably impossible)
> So it's not an O(N) cost you've added it, it's N^2 or N^3 in the worst
> case.
>
>
> There's only one corner-case where a recalculate occurs. The rest of the
>> time it's incremental updates.  There may still be opportunities to batch
>> updates but I honestly have no idea if it will be more efficient.
>>
>
> It will, because it will cause less updating in practice.
>
> If jump threading changes edge A->B to A->C (IE insert new edge, remove
> old edge)  and then changes edge A->C to A->D , you will perform two
> updates.
> If those require full recalculation of the tree (internal to the updater),
> you will update the tree twice.
> Batching will perform one.
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20171013/c2849743/attachment.html>