[PATCH] D48967: [Dominators] Convert existing passes and utils to use the DomTreeUpdater class

Tue Jul 10 10:47:21 PDT 2018

brzycki added inline comments.

================
Comment at: lib/Transforms/Scalar/JumpThreading.cpp:420
   LoopHeaders.clear();
-  DDT->flush();
+  DTU->getDomTree();
   LVI->enableDT();
----------------
NutshellySima wrote:
> NutshellySima wrote:
> > brzycki wrote:
> > > brzycki wrote:
> > > > NutshellySima wrote:
> > > > > NutshellySima wrote:
> > > > > > brzycki wrote:
> > > > > > > NutshellySima wrote:
> > > > > > > > brzycki wrote:
> > > > > > > > > NutshellySima wrote:
> > > > > > > > > > brzycki wrote:
> > > > > > > > > > > It's not obvious here the reasoning behind this statement. I recommend either changing this to `DTU->flush()` or adding a comment stating the call will force all updates to the internal DT/PDT trees.
> > > > > > > > > > I think `DTU->flush()` should only be called when people need either
> > > > > > > > > > 1. Flush all deleteBB awaiting deletion.
> > > > > > > > > > 2. Do queries on both DT and PDT.
> > > > > > > > > > 
> > > > > > > > > > In the future, if DTU holds both DT and PDT in JumpThreading, call `DTU->getDomTree` which only flushes DT can pend updates on PDT as lazy as possible.
> > > > > > > > > I'm concerned the complexity of updates will become unmanageable if you decide to take this route.  Some questions I have:
> > > > > > > > > * How do you preserve the list of updates for one vs the other? Two separate queues?
> > > > > > > > > * How do you decide to delete unnecessary updates from one and not the other?
> > > > > > > > > * How do you manage the list of delted BBs? Are they only really deleted when both DT and PDT are flushed?
> > > > > > > > > * How do you make sure you don't actually delete a BB early?
> > > > > > > > > * How do you handle `Eager` vs `Lazy` updates?
> > > > > > > > > * How do you handle `*Relaxed()` updates?
> > > > > > > > > 
> > > > > > > > > The longer you avoid flushing the further your updates become from the real CFG's shape. Debugging an assert where DT and PDT update at different times means the developer has to also carefully keep the state of DTU in mind instead of just the IR and the trees. I'm not sure if the possible saved overhead of updates to one or the other tree is worth all that added complexity.
> > > > > > > > > How do you preserve the list of updates for one vs the other? Two separate queues?
> > > > > > > > There is only one queue and two index to record the updates that have been applied and not applied yet.
> > > > > > > > 
> > > > > > > > | Index | 0 |  1 |  2 | 3
> > > > > > > > | PendUpdates | o | o | o | o 
> > > > > > > > |DTUpdateIndex | | | ↑ | |
> > > > > > > > | PDTUpdateIndex |↑ | | | |
> > > > > > > > 
> > > > > > > > > How do you decide to delete unnecessary updates from one and not the other?
> > > > > > > > If an update is applied by all available trees, it will be purged.
> > > > > > > > 
> > > > > > > > > How do you manage the list of deleted BBs? Are they only really deleted when both DT and PDT are flushed?
> > > > > > > > Yes, more specifically `DTUpdateIndex == PDTUpdateIndex == PendUpdates.size()`. It's impossible to judge whether a specific DelBB is feasible to be deleted at a specific time point because some callers of delBB just
> > > > > > > > ```
> > > > > > > > BasicBlock *BB0 = ...;
> > > > > > > > DTU.delBB (BB0);
> > > > > > > > // Then do some actions on BB0
> > > > > > > > ... ...
> > > > > > > > DTU.flush(); // The caller expects BB0 be deleted now.
> > > > > > > > ```
> > > > > > > > > How do you make sure you don't actually delete a BB early?
> > > > > > > > Mentioned above.
> > > > > > > > > How do you handle `Eager` vs `Lazy` updates?
> > > > > > > > `applyUpdates`: 
> > > > > > > > `Lazy/Eager+ForceRemoveDuplicates` - Works the same as DDT, more specifically
> > > > > > > > 1. Deduplicate, which is the main functionality of DDT. 
> > > > > > > > 2. Inspect whether an edge is presented in CFG and decide whether to discard it. It is a must because the update sequence is unbalanced.
> > > > > > > > `Eager` - Works the same as raw DT/PDT.
> > > > > > > > `insertEdge/deleteEdge`:
> > > > > > > > `Eager`: Works the same as raw DT/PDT and adds inspection whether the update is valid.
> > > > > > > > `Lazy`: then submits into the list of updates rather than apply it.
> > > > > > > > `recalculate`:
> > > > > > > > `Lazy`: `flush()` + raw `recalculate()`
> > > > > > > > `Eager`: raw `recalculate()`
> > > > > > > > > How do you handle `*Relaxed()` updates?
> > > > > > > > It behaves the same as the ones in DDT. It discards invalid updates quietly.
> > > > > > > > 
> > > > > > > > > The longer you avoid flushing the further your updates become from the real CFG's shape. Debugging an assert where DT and PDT update at different times means the developer has to also carefully keep the state of DTU in mind instead of just the IR and the trees. I'm not sure if the possible saved overhead of updates to one or the other tree is worth all that added complexity.
> > > > > > > > 
> > > > > > > > I don't think so. With current DDT, let me show an example
> > > > > > > > ```
> > > > > > > > // Changing the state of CFG begins. 
> > > > > > > > ... // Modify CFG
> > > > > > > > MergeBasicBlockIntoOnlyPred (..., &DDT); // Some error happens !
> > > > > > > > // (You cannot do verification after MergeBasicBlockIntoOnlyPred because the CFG and DT is unsynced)
> > > > > > > > ... // Modify DomTree
> > > > > > > > ... // Modify CFG
> > > > > > > > // Changing the state of CFG ends.
> > > > > > > > ...
> > > > > > > > DDT.flush().verify(); // Asserts !
> > > > > > > > ```
> > > > > > > > Now with DTU,
> > > > > > > > ```
> > > > > > > > // Changing the state of CFG begins.
> > > > > > > > ... // Modify CFG
> > > > > > > > MergeBasicBlockIntoOnlyPred (..., &DTU); // Some error happens !
> > > > > > > > //  (You cannot do verification after MergeBasicBlockIntoOnlyPred because the CFG and DT is unsynced)
> > > > > > > > ... // Modify DomTree
> > > > > > > > ... // Modify CFG
> > > > > > > > // Changing the state of CFG ends.
> > > > > > > > ...
> > > > > > > > DTU.getDomTree().verify(); // Asserts !
> > > > > > > > ```
> > > > > > > > I can't see the difference. They'll assert at the same location and all far away from where the bug happens.
> > > > > > > Let's say you have a `DTU` using the `Lazy` strategy. The current set of internally queued pending updates are the following:
> > > > > > > ```
> > > > > > > [0] {Insert, A, B}
> > > > > > > [1] {Delete, B, C}
> > > > > > > [2] {Insert, B, D}
> > > > > > > ```
> > > > > > > 
> > > > > > > And you have both a DT and PDT. Their Indicies are the following:
> > > > > > > ```
> > > > > > > DTU->IndexDT = 0;
> > > > > > > DTU->IndexPDT = 0;
> > > > > > > ```
> > > > > > > 
> > > > > > > And you flush just the DT:
> > > > > > > ```
> > > > > > > DTU->getDomTree(); // flush DT only
> > > > > > > ```
> > > > > > > 
> > > > > > > Now your indicies are the following:
> > > > > > > ```
> > > > > > > DTU->IndexDT = 3;
> > > > > > > DTU0>IndexPDT = 0;
> > > > > > > ```
> > > > > > > 
> > > > > > > And now the following update comes along:
> > > > > > > ```
> > > > > > > DTU->deleteEdgeRelaxed(A,B);
> > > > > > > ```
> > > > > > > 
> > > > > > > In the case of PDT you want nothing to happen: this update will remove the inverted update [0] from the queue and result in a no-op to PDT.  But in the case of the DT you need the update to happen: it's a change past the flush.  How do you reconcile this internally with only one queue?
> > > > > > > In the case of PDT you want nothing to happen: this update will remove the inverted update [0] from the queue and result in a no-op to PDT. But in the case of the DT you need the update to happen: it's a change past the flush. How do you reconcile this internally with only one queue?
> > > > > > In this case, DTU will only inspect updates after `std::max(IndexDT, IndexPDT)` and try to remove no-ops in that range.
> > > > > > 
> > > > > > Q: Why I choose this way, but not flush both trees when calling `getDT/getPDT/flush`?
> > > > > > A: In `include/llvm/Support/GenericDomTree.h`, comments of `applyUpdates`:
> > > > > >   /// Batch updates should be generally faster when performing longer sequences
> > > > > >   /// of updates than calling insertEdge/deleteEdge manually multiple times, as
> > > > > >   /// it can reorder the updates and remove redundant ones internally.
> > > > > > It seems that there is performance gain. So I prefer to preserve a longer sequence of updates.
> > > > > More info added to the above comment:
> > > > > I think the only thing users of both `DDT` and `DTU` need to be careful is that they mustn't submit any updates that is already in the CFG + already known by the DT/PDT before submitting updates.
> > > > > An example:
> > > > > ```
> > > > > // Lazy DTU/DDT
> > > > > // Edge A->B already in the CFG + DT knows that
> > > > > DTU.insertEdge(A, B); // Get queued
> > > > > ... // delete A->B in the CFG
> > > > > DTU.deleteEdge(A, B); // Cancel with the above one
> > > > > ```
> > > > > This behavior is not supposed to be a functionality at least in `DTU` now. I'm not sure whether there are people using the `DDT` APIs like this.
> > > > Understood, you are pessimizing early removal of paired updates. I strongly recommend you testing your assumptions about performance of long updates using the `CTMark` option of `test-suite` to reduce the chance of reverts due to performance degradation:
> > > > ```
> > > > cd path/to/cloned/test-suite
> > > > cmake ... \
> > > >     ... \
> > > >     -D TEST_SUITE_SUBDIRS=CTMark \
> > > >     -D TEST_SUITE_RUN_BENCHMARKS=0" \
> > > >     ... \
> > > >     $PWD
> > > > ```
> > > > Note I recommend this in addition to regular `test-suite` runs designed to detect crashes and `assert()`s.
> > > > This behavior is not supposed to be a functionality at least in DTU now. I'm not sure whether there are people using the DDT APIs like this.
> > > I can guarantee you this happens all over JumpThreading. The pass contains no information about prior state or previous failed attempts. I have seen cases where the same edge is inserted and removed 3 or more times across one complete run of the pass.  There is a reason why I was forced to implement `DDT` with the default as `Relaxed`. :)
> > > 
> > > 
> > > 
> > > 
> > I understand the case when trying to inspect the CFG to remove the `not really happened` ones in `jump-threading` because it won't be queued and affect following updates.
> > And I'm thinking whether there is anyone submitting `already happened and DT/PDT also knew` ones.
> > I'm confused with your reply. In the example I mentioned (DT already got notified before that A->B exists but we still submit one {Insert, A, B})
> > ```
> > // Lazy DTU/DDT
> > // Edge A->B already in the CFG + DT knows that
> > DTU.insertEdge(A, B); // Get queued {Insert, A, B}
> > ... // delete A->B in the CFG
> > DTU.deleteEdge(A, B); // Cancel with the above one {Delete, A, B} cancels {Insert, A, B}
> > assert (DTU.getDT().verify()); // Asserts! (DT doesn't know A->B is gone!)
> > ```
> > It can cause the state of DT not matching the state of the CFG.
> Thanks for your advice.
> But I doubt whether this test can show the performance degradation in the *current* LLVM codebase because
> first, this behavior only affects 
> 1. a DTU under Lazy UpdateStrategy with both DT and PDT 
> 2. trying to perform a long sequence of updates
> second, in the current LLVM codebase, only `adce` incrementally updates `PDT` with an extremely small update vector.
`CTMark` is a decent indicator of compile-time performance. It can help prevent patches being reverted due to compile-time benchmarks run by others.  I understand that this specific case is not exercised by anything other than your test cases, but when you (or someone else) actually starts using it this will be invaluable to verifying your hypothesis.

================
Comment at: lib/Transforms/Scalar/JumpThreading.cpp:420
   LoopHeaders.clear();
-  DDT->flush();
+  DTU->getDomTree();
   LVI->enableDT();
----------------
brzycki wrote:
> NutshellySima wrote:
> > NutshellySima wrote:
> > > brzycki wrote:
> > > > brzycki wrote:
> > > > > NutshellySima wrote:
> > > > > > NutshellySima wrote:
> > > > > > > brzycki wrote:
> > > > > > > > NutshellySima wrote:
> > > > > > > > > brzycki wrote:
> > > > > > > > > > NutshellySima wrote:
> > > > > > > > > > > brzycki wrote:
> > > > > > > > > > > > It's not obvious here the reasoning behind this statement. I recommend either changing this to `DTU->flush()` or adding a comment stating the call will force all updates to the internal DT/PDT trees.
> > > > > > > > > > > I think `DTU->flush()` should only be called when people need either
> > > > > > > > > > > 1. Flush all deleteBB awaiting deletion.
> > > > > > > > > > > 2. Do queries on both DT and PDT.
> > > > > > > > > > > 
> > > > > > > > > > > In the future, if DTU holds both DT and PDT in JumpThreading, call `DTU->getDomTree` which only flushes DT can pend updates on PDT as lazy as possible.
> > > > > > > > > > I'm concerned the complexity of updates will become unmanageable if you decide to take this route.  Some questions I have:
> > > > > > > > > > * How do you preserve the list of updates for one vs the other? Two separate queues?
> > > > > > > > > > * How do you decide to delete unnecessary updates from one and not the other?
> > > > > > > > > > * How do you manage the list of delted BBs? Are they only really deleted when both DT and PDT are flushed?
> > > > > > > > > > * How do you make sure you don't actually delete a BB early?
> > > > > > > > > > * How do you handle `Eager` vs `Lazy` updates?
> > > > > > > > > > * How do you handle `*Relaxed()` updates?
> > > > > > > > > > 
> > > > > > > > > > The longer you avoid flushing the further your updates become from the real CFG's shape. Debugging an assert where DT and PDT update at different times means the developer has to also carefully keep the state of DTU in mind instead of just the IR and the trees. I'm not sure if the possible saved overhead of updates to one or the other tree is worth all that added complexity.
> > > > > > > > > > How do you preserve the list of updates for one vs the other? Two separate queues?
> > > > > > > > > There is only one queue and two index to record the updates that have been applied and not applied yet.
> > > > > > > > > 
> > > > > > > > > | Index | 0 |  1 |  2 | 3
> > > > > > > > > | PendUpdates | o | o | o | o 
> > > > > > > > > |DTUpdateIndex | | | ↑ | |
> > > > > > > > > | PDTUpdateIndex |↑ | | | |
> > > > > > > > > 
> > > > > > > > > > How do you decide to delete unnecessary updates from one and not the other?
> > > > > > > > > If an update is applied by all available trees, it will be purged.
> > > > > > > > > 
> > > > > > > > > > How do you manage the list of deleted BBs? Are they only really deleted when both DT and PDT are flushed?
> > > > > > > > > Yes, more specifically `DTUpdateIndex == PDTUpdateIndex == PendUpdates.size()`. It's impossible to judge whether a specific DelBB is feasible to be deleted at a specific time point because some callers of delBB just
> > > > > > > > > ```
> > > > > > > > > BasicBlock *BB0 = ...;
> > > > > > > > > DTU.delBB (BB0);
> > > > > > > > > // Then do some actions on BB0
> > > > > > > > > ... ...
> > > > > > > > > DTU.flush(); // The caller expects BB0 be deleted now.
> > > > > > > > > ```
> > > > > > > > > > How do you make sure you don't actually delete a BB early?
> > > > > > > > > Mentioned above.
> > > > > > > > > > How do you handle `Eager` vs `Lazy` updates?
> > > > > > > > > `applyUpdates`: 
> > > > > > > > > `Lazy/Eager+ForceRemoveDuplicates` - Works the same as DDT, more specifically
> > > > > > > > > 1. Deduplicate, which is the main functionality of DDT. 
> > > > > > > > > 2. Inspect whether an edge is presented in CFG and decide whether to discard it. It is a must because the update sequence is unbalanced.
> > > > > > > > > `Eager` - Works the same as raw DT/PDT.
> > > > > > > > > `insertEdge/deleteEdge`:
> > > > > > > > > `Eager`: Works the same as raw DT/PDT and adds inspection whether the update is valid.
> > > > > > > > > `Lazy`: then submits into the list of updates rather than apply it.
> > > > > > > > > `recalculate`:
> > > > > > > > > `Lazy`: `flush()` + raw `recalculate()`
> > > > > > > > > `Eager`: raw `recalculate()`
> > > > > > > > > > How do you handle `*Relaxed()` updates?
> > > > > > > > > It behaves the same as the ones in DDT. It discards invalid updates quietly.
> > > > > > > > > 
> > > > > > > > > > The longer you avoid flushing the further your updates become from the real CFG's shape. Debugging an assert where DT and PDT update at different times means the developer has to also carefully keep the state of DTU in mind instead of just the IR and the trees. I'm not sure if the possible saved overhead of updates to one or the other tree is worth all that added complexity.
> > > > > > > > > 
> > > > > > > > > I don't think so. With current DDT, let me show an example
> > > > > > > > > ```
> > > > > > > > > // Changing the state of CFG begins. 
> > > > > > > > > ... // Modify CFG
> > > > > > > > > MergeBasicBlockIntoOnlyPred (..., &DDT); // Some error happens !
> > > > > > > > > // (You cannot do verification after MergeBasicBlockIntoOnlyPred because the CFG and DT is unsynced)
> > > > > > > > > ... // Modify DomTree
> > > > > > > > > ... // Modify CFG
> > > > > > > > > // Changing the state of CFG ends.
> > > > > > > > > ...
> > > > > > > > > DDT.flush().verify(); // Asserts !
> > > > > > > > > ```
> > > > > > > > > Now with DTU,
> > > > > > > > > ```
> > > > > > > > > // Changing the state of CFG begins.
> > > > > > > > > ... // Modify CFG
> > > > > > > > > MergeBasicBlockIntoOnlyPred (..., &DTU); // Some error happens !
> > > > > > > > > //  (You cannot do verification after MergeBasicBlockIntoOnlyPred because the CFG and DT is unsynced)
> > > > > > > > > ... // Modify DomTree
> > > > > > > > > ... // Modify CFG
> > > > > > > > > // Changing the state of CFG ends.
> > > > > > > > > ...
> > > > > > > > > DTU.getDomTree().verify(); // Asserts !
> > > > > > > > > ```
> > > > > > > > > I can't see the difference. They'll assert at the same location and all far away from where the bug happens.
> > > > > > > > Let's say you have a `DTU` using the `Lazy` strategy. The current set of internally queued pending updates are the following:
> > > > > > > > ```
> > > > > > > > [0] {Insert, A, B}
> > > > > > > > [1] {Delete, B, C}
> > > > > > > > [2] {Insert, B, D}
> > > > > > > > ```
> > > > > > > > 
> > > > > > > > And you have both a DT and PDT. Their Indicies are the following:
> > > > > > > > ```
> > > > > > > > DTU->IndexDT = 0;
> > > > > > > > DTU->IndexPDT = 0;
> > > > > > > > ```
> > > > > > > > 
> > > > > > > > And you flush just the DT:
> > > > > > > > ```
> > > > > > > > DTU->getDomTree(); // flush DT only
> > > > > > > > ```
> > > > > > > > 
> > > > > > > > Now your indicies are the following:
> > > > > > > > ```
> > > > > > > > DTU->IndexDT = 3;
> > > > > > > > DTU0>IndexPDT = 0;
> > > > > > > > ```
> > > > > > > > 
> > > > > > > > And now the following update comes along:
> > > > > > > > ```
> > > > > > > > DTU->deleteEdgeRelaxed(A,B);
> > > > > > > > ```
> > > > > > > > 
> > > > > > > > In the case of PDT you want nothing to happen: this update will remove the inverted update [0] from the queue and result in a no-op to PDT.  But in the case of the DT you need the update to happen: it's a change past the flush.  How do you reconcile this internally with only one queue?
> > > > > > > > In the case of PDT you want nothing to happen: this update will remove the inverted update [0] from the queue and result in a no-op to PDT. But in the case of the DT you need the update to happen: it's a change past the flush. How do you reconcile this internally with only one queue?
> > > > > > > In this case, DTU will only inspect updates after `std::max(IndexDT, IndexPDT)` and try to remove no-ops in that range.
> > > > > > > 
> > > > > > > Q: Why I choose this way, but not flush both trees when calling `getDT/getPDT/flush`?
> > > > > > > A: In `include/llvm/Support/GenericDomTree.h`, comments of `applyUpdates`:
> > > > > > >   /// Batch updates should be generally faster when performing longer sequences
> > > > > > >   /// of updates than calling insertEdge/deleteEdge manually multiple times, as
> > > > > > >   /// it can reorder the updates and remove redundant ones internally.
> > > > > > > It seems that there is performance gain. So I prefer to preserve a longer sequence of updates.
> > > > > > More info added to the above comment:
> > > > > > I think the only thing users of both `DDT` and `DTU` need to be careful is that they mustn't submit any updates that is already in the CFG + already known by the DT/PDT before submitting updates.
> > > > > > An example:
> > > > > > ```
> > > > > > // Lazy DTU/DDT
> > > > > > // Edge A->B already in the CFG + DT knows that
> > > > > > DTU.insertEdge(A, B); // Get queued
> > > > > > ... // delete A->B in the CFG
> > > > > > DTU.deleteEdge(A, B); // Cancel with the above one
> > > > > > ```
> > > > > > This behavior is not supposed to be a functionality at least in `DTU` now. I'm not sure whether there are people using the `DDT` APIs like this.
> > > > > Understood, you are pessimizing early removal of paired updates. I strongly recommend you testing your assumptions about performance of long updates using the `CTMark` option of `test-suite` to reduce the chance of reverts due to performance degradation:
> > > > > ```
> > > > > cd path/to/cloned/test-suite
> > > > > cmake ... \
> > > > >     ... \
> > > > >     -D TEST_SUITE_SUBDIRS=CTMark \
> > > > >     -D TEST_SUITE_RUN_BENCHMARKS=0" \
> > > > >     ... \
> > > > >     $PWD
> > > > > ```
> > > > > Note I recommend this in addition to regular `test-suite` runs designed to detect crashes and `assert()`s.
> > > > > This behavior is not supposed to be a functionality at least in DTU now. I'm not sure whether there are people using the DDT APIs like this.
> > > > I can guarantee you this happens all over JumpThreading. The pass contains no information about prior state or previous failed attempts. I have seen cases where the same edge is inserted and removed 3 or more times across one complete run of the pass.  There is a reason why I was forced to implement `DDT` with the default as `Relaxed`. :)
> > > > 
> > > > 
> > > > 
> > > > 
> > > I understand the case when trying to inspect the CFG to remove the `not really happened` ones in `jump-threading` because it won't be queued and affect following updates.
> > > And I'm thinking whether there is anyone submitting `already happened and DT/PDT also knew` ones.
> > > I'm confused with your reply. In the example I mentioned (DT already got notified before that A->B exists but we still submit one {Insert, A, B})
> > > ```
> > > // Lazy DTU/DDT
> > > // Edge A->B already in the CFG + DT knows that
> > > DTU.insertEdge(A, B); // Get queued {Insert, A, B}
> > > ... // delete A->B in the CFG
> > > DTU.deleteEdge(A, B); // Cancel with the above one {Delete, A, B} cancels {Insert, A, B}
> > > assert (DTU.getDT().verify()); // Asserts! (DT doesn't know A->B is gone!)
> > > ```
> > > It can cause the state of DT not matching the state of the CFG.
> > Thanks for your advice.
> > But I doubt whether this test can show the performance degradation in the *current* LLVM codebase because
> > first, this behavior only affects 
> > 1. a DTU under Lazy UpdateStrategy with both DT and PDT 
> > 2. trying to perform a long sequence of updates
> > second, in the current LLVM codebase, only `adce` incrementally updates `PDT` with an extremely small update vector.
> `CTMark` is a decent indicator of compile-time performance. It can help prevent patches being reverted due to compile-time benchmarks run by others.  I understand that this specific case is not exercised by anything other than your test cases, but when you (or someone else) actually starts using it this will be invaluable to verifying your hypothesis.
The longer you defer a flush the higher the chance of seeing very strange updates and changes between the CFG and the state of `DT`. I cannot give you a specific example at this time but I do recall state changes just as you described. Here is one code path I am concerned about:
* `JumpThreading::ProcessBlock()` decides to split `A` into `A -> A'` with a single unconditional branch between them.
* The rest of `ProcessBlock()` is unable to thread the jump and exits with failure.
* Another, unrelated block `Q` is evaluated and `ProcessBlock()` successfully calls `ThreadEdge()`. With the SSA updates they added a `DDT->flush()` inside `ThreadEdge()` and would flush to `DT`. This patch is currently reverted, but can come back. It's also possible anyone can add a patch to add `DDT->flush()` on another call path under `ProcessBlock()`.
* `runImpl()` comes back to processing `A` again and goes to the end of `JumpThreading::runImpl()`. It sees `A` has a single successor `A'` and calls ` TryToSimplifyUncondBranchFromEmptyBlock()` which re-fuses `A`.

If you defer your updates across this entire set of changes you will see:
```
// CreateBB(A');
DTU.insertEdge(A, A');
// Give up on A
...
// Work on Q, call a flush() in ThreadEdge() or somehwere else under ProcessBlock()
...
// Back on A
DTU.deleteEdge(A, A');
DTU.deleteBB(A');
```
The interactions between `JumpThreading`, the state of the IR, the current state of the `DT`, and the state of pending `DDT` updates can lead to very complex states between them. You cannot safely assume no one will do updates that don't make sense or seem to be extra work. It's also why I added the Inversion check early on. For large functions detecting those early helped save a few percent of performance time.  The more real world code you run through your algorithm will verify if your assumptions are valid.

================
Comment at: lib/Transforms/Scalar/JumpThreading.cpp:420
   LoopHeaders.clear();
-  DDT->flush();
+  DTU->getDomTree();
   LVI->enableDT();
----------------
brzycki wrote:
> brzycki wrote:
> > NutshellySima wrote:
> > > NutshellySima wrote:
> > > > brzycki wrote:
> > > > > brzycki wrote:
> > > > > > NutshellySima wrote:
> > > > > > > NutshellySima wrote:
> > > > > > > > brzycki wrote:
> > > > > > > > > NutshellySima wrote:
> > > > > > > > > > brzycki wrote:
> > > > > > > > > > > NutshellySima wrote:
> > > > > > > > > > > > brzycki wrote:
> > > > > > > > > > > > > It's not obvious here the reasoning behind this statement. I recommend either changing this to `DTU->flush()` or adding a comment stating the call will force all updates to the internal DT/PDT trees.
> > > > > > > > > > > > I think `DTU->flush()` should only be called when people need either
> > > > > > > > > > > > 1. Flush all deleteBB awaiting deletion.
> > > > > > > > > > > > 2. Do queries on both DT and PDT.
> > > > > > > > > > > > 
> > > > > > > > > > > > In the future, if DTU holds both DT and PDT in JumpThreading, call `DTU->getDomTree` which only flushes DT can pend updates on PDT as lazy as possible.
> > > > > > > > > > > I'm concerned the complexity of updates will become unmanageable if you decide to take this route.  Some questions I have:
> > > > > > > > > > > * How do you preserve the list of updates for one vs the other? Two separate queues?
> > > > > > > > > > > * How do you decide to delete unnecessary updates from one and not the other?
> > > > > > > > > > > * How do you manage the list of delted BBs? Are they only really deleted when both DT and PDT are flushed?
> > > > > > > > > > > * How do you make sure you don't actually delete a BB early?
> > > > > > > > > > > * How do you handle `Eager` vs `Lazy` updates?
> > > > > > > > > > > * How do you handle `*Relaxed()` updates?
> > > > > > > > > > > 
> > > > > > > > > > > The longer you avoid flushing the further your updates become from the real CFG's shape. Debugging an assert where DT and PDT update at different times means the developer has to also carefully keep the state of DTU in mind instead of just the IR and the trees. I'm not sure if the possible saved overhead of updates to one or the other tree is worth all that added complexity.
> > > > > > > > > > > How do you preserve the list of updates for one vs the other? Two separate queues?
> > > > > > > > > > There is only one queue and two index to record the updates that have been applied and not applied yet.
> > > > > > > > > > 
> > > > > > > > > > | Index | 0 |  1 |  2 | 3
> > > > > > > > > > | PendUpdates | o | o | o | o 
> > > > > > > > > > |DTUpdateIndex | | | ↑ | |
> > > > > > > > > > | PDTUpdateIndex |↑ | | | |
> > > > > > > > > > 
> > > > > > > > > > > How do you decide to delete unnecessary updates from one and not the other?
> > > > > > > > > > If an update is applied by all available trees, it will be purged.
> > > > > > > > > > 
> > > > > > > > > > > How do you manage the list of deleted BBs? Are they only really deleted when both DT and PDT are flushed?
> > > > > > > > > > Yes, more specifically `DTUpdateIndex == PDTUpdateIndex == PendUpdates.size()`. It's impossible to judge whether a specific DelBB is feasible to be deleted at a specific time point because some callers of delBB just
> > > > > > > > > > ```
> > > > > > > > > > BasicBlock *BB0 = ...;
> > > > > > > > > > DTU.delBB (BB0);
> > > > > > > > > > // Then do some actions on BB0
> > > > > > > > > > ... ...
> > > > > > > > > > DTU.flush(); // The caller expects BB0 be deleted now.
> > > > > > > > > > ```
> > > > > > > > > > > How do you make sure you don't actually delete a BB early?
> > > > > > > > > > Mentioned above.
> > > > > > > > > > > How do you handle `Eager` vs `Lazy` updates?
> > > > > > > > > > `applyUpdates`: 
> > > > > > > > > > `Lazy/Eager+ForceRemoveDuplicates` - Works the same as DDT, more specifically
> > > > > > > > > > 1. Deduplicate, which is the main functionality of DDT. 
> > > > > > > > > > 2. Inspect whether an edge is presented in CFG and decide whether to discard it. It is a must because the update sequence is unbalanced.
> > > > > > > > > > `Eager` - Works the same as raw DT/PDT.
> > > > > > > > > > `insertEdge/deleteEdge`:
> > > > > > > > > > `Eager`: Works the same as raw DT/PDT and adds inspection whether the update is valid.
> > > > > > > > > > `Lazy`: then submits into the list of updates rather than apply it.
> > > > > > > > > > `recalculate`:
> > > > > > > > > > `Lazy`: `flush()` + raw `recalculate()`
> > > > > > > > > > `Eager`: raw `recalculate()`
> > > > > > > > > > > How do you handle `*Relaxed()` updates?
> > > > > > > > > > It behaves the same as the ones in DDT. It discards invalid updates quietly.
> > > > > > > > > > 
> > > > > > > > > > > The longer you avoid flushing the further your updates become from the real CFG's shape. Debugging an assert where DT and PDT update at different times means the developer has to also carefully keep the state of DTU in mind instead of just the IR and the trees. I'm not sure if the possible saved overhead of updates to one or the other tree is worth all that added complexity.
> > > > > > > > > > 
> > > > > > > > > > I don't think so. With current DDT, let me show an example
> > > > > > > > > > ```
> > > > > > > > > > // Changing the state of CFG begins. 
> > > > > > > > > > ... // Modify CFG
> > > > > > > > > > MergeBasicBlockIntoOnlyPred (..., &DDT); // Some error happens !
> > > > > > > > > > // (You cannot do verification after MergeBasicBlockIntoOnlyPred because the CFG and DT is unsynced)
> > > > > > > > > > ... // Modify DomTree
> > > > > > > > > > ... // Modify CFG
> > > > > > > > > > // Changing the state of CFG ends.
> > > > > > > > > > ...
> > > > > > > > > > DDT.flush().verify(); // Asserts !
> > > > > > > > > > ```
> > > > > > > > > > Now with DTU,
> > > > > > > > > > ```
> > > > > > > > > > // Changing the state of CFG begins.
> > > > > > > > > > ... // Modify CFG
> > > > > > > > > > MergeBasicBlockIntoOnlyPred (..., &DTU); // Some error happens !
> > > > > > > > > > //  (You cannot do verification after MergeBasicBlockIntoOnlyPred because the CFG and DT is unsynced)
> > > > > > > > > > ... // Modify DomTree
> > > > > > > > > > ... // Modify CFG
> > > > > > > > > > // Changing the state of CFG ends.
> > > > > > > > > > ...
> > > > > > > > > > DTU.getDomTree().verify(); // Asserts !
> > > > > > > > > > ```
> > > > > > > > > > I can't see the difference. They'll assert at the same location and all far away from where the bug happens.
> > > > > > > > > Let's say you have a `DTU` using the `Lazy` strategy. The current set of internally queued pending updates are the following:
> > > > > > > > > ```
> > > > > > > > > [0] {Insert, A, B}
> > > > > > > > > [1] {Delete, B, C}
> > > > > > > > > [2] {Insert, B, D}
> > > > > > > > > ```
> > > > > > > > > 
> > > > > > > > > And you have both a DT and PDT. Their Indicies are the following:
> > > > > > > > > ```
> > > > > > > > > DTU->IndexDT = 0;
> > > > > > > > > DTU->IndexPDT = 0;
> > > > > > > > > ```
> > > > > > > > > 
> > > > > > > > > And you flush just the DT:
> > > > > > > > > ```
> > > > > > > > > DTU->getDomTree(); // flush DT only
> > > > > > > > > ```
> > > > > > > > > 
> > > > > > > > > Now your indicies are the following:
> > > > > > > > > ```
> > > > > > > > > DTU->IndexDT = 3;
> > > > > > > > > DTU0>IndexPDT = 0;
> > > > > > > > > ```
> > > > > > > > > 
> > > > > > > > > And now the following update comes along:
> > > > > > > > > ```
> > > > > > > > > DTU->deleteEdgeRelaxed(A,B);
> > > > > > > > > ```
> > > > > > > > > 
> > > > > > > > > In the case of PDT you want nothing to happen: this update will remove the inverted update [0] from the queue and result in a no-op to PDT.  But in the case of the DT you need the update to happen: it's a change past the flush.  How do you reconcile this internally with only one queue?
> > > > > > > > > In the case of PDT you want nothing to happen: this update will remove the inverted update [0] from the queue and result in a no-op to PDT. But in the case of the DT you need the update to happen: it's a change past the flush. How do you reconcile this internally with only one queue?
> > > > > > > > In this case, DTU will only inspect updates after `std::max(IndexDT, IndexPDT)` and try to remove no-ops in that range.
> > > > > > > > 
> > > > > > > > Q: Why I choose this way, but not flush both trees when calling `getDT/getPDT/flush`?
> > > > > > > > A: In `include/llvm/Support/GenericDomTree.h`, comments of `applyUpdates`:
> > > > > > > >   /// Batch updates should be generally faster when performing longer sequences
> > > > > > > >   /// of updates than calling insertEdge/deleteEdge manually multiple times, as
> > > > > > > >   /// it can reorder the updates and remove redundant ones internally.
> > > > > > > > It seems that there is performance gain. So I prefer to preserve a longer sequence of updates.
> > > > > > > More info added to the above comment:
> > > > > > > I think the only thing users of both `DDT` and `DTU` need to be careful is that they mustn't submit any updates that is already in the CFG + already known by the DT/PDT before submitting updates.
> > > > > > > An example:
> > > > > > > ```
> > > > > > > // Lazy DTU/DDT
> > > > > > > // Edge A->B already in the CFG + DT knows that
> > > > > > > DTU.insertEdge(A, B); // Get queued
> > > > > > > ... // delete A->B in the CFG
> > > > > > > DTU.deleteEdge(A, B); // Cancel with the above one
> > > > > > > ```
> > > > > > > This behavior is not supposed to be a functionality at least in `DTU` now. I'm not sure whether there are people using the `DDT` APIs like this.
> > > > > > Understood, you are pessimizing early removal of paired updates. I strongly recommend you testing your assumptions about performance of long updates using the `CTMark` option of `test-suite` to reduce the chance of reverts due to performance degradation:
> > > > > > ```
> > > > > > cd path/to/cloned/test-suite
> > > > > > cmake ... \
> > > > > >     ... \
> > > > > >     -D TEST_SUITE_SUBDIRS=CTMark \
> > > > > >     -D TEST_SUITE_RUN_BENCHMARKS=0" \
> > > > > >     ... \
> > > > > >     $PWD
> > > > > > ```
> > > > > > Note I recommend this in addition to regular `test-suite` runs designed to detect crashes and `assert()`s.
> > > > > > This behavior is not supposed to be a functionality at least in DTU now. I'm not sure whether there are people using the DDT APIs like this.
> > > > > I can guarantee you this happens all over JumpThreading. The pass contains no information about prior state or previous failed attempts. I have seen cases where the same edge is inserted and removed 3 or more times across one complete run of the pass.  There is a reason why I was forced to implement `DDT` with the default as `Relaxed`. :)
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > I understand the case when trying to inspect the CFG to remove the `not really happened` ones in `jump-threading` because it won't be queued and affect following updates.
> > > > And I'm thinking whether there is anyone submitting `already happened and DT/PDT also knew` ones.
> > > > I'm confused with your reply. In the example I mentioned (DT already got notified before that A->B exists but we still submit one {Insert, A, B})
> > > > ```
> > > > // Lazy DTU/DDT
> > > > // Edge A->B already in the CFG + DT knows that
> > > > DTU.insertEdge(A, B); // Get queued {Insert, A, B}
> > > > ... // delete A->B in the CFG
> > > > DTU.deleteEdge(A, B); // Cancel with the above one {Delete, A, B} cancels {Insert, A, B}
> > > > assert (DTU.getDT().verify()); // Asserts! (DT doesn't know A->B is gone!)
> > > > ```
> > > > It can cause the state of DT not matching the state of the CFG.
> > > Thanks for your advice.
> > > But I doubt whether this test can show the performance degradation in the *current* LLVM codebase because
> > > first, this behavior only affects 
> > > 1. a DTU under Lazy UpdateStrategy with both DT and PDT 
> > > 2. trying to perform a long sequence of updates
> > > second, in the current LLVM codebase, only `adce` incrementally updates `PDT` with an extremely small update vector.
> > `CTMark` is a decent indicator of compile-time performance. It can help prevent patches being reverted due to compile-time benchmarks run by others.  I understand that this specific case is not exercised by anything other than your test cases, but when you (or someone else) actually starts using it this will be invaluable to verifying your hypothesis.
> The longer you defer a flush the higher the chance of seeing very strange updates and changes between the CFG and the state of `DT`. I cannot give you a specific example at this time but I do recall state changes just as you described. Here is one code path I am concerned about:
> * `JumpThreading::ProcessBlock()` decides to split `A` into `A -> A'` with a single unconditional branch between them.
> * The rest of `ProcessBlock()` is unable to thread the jump and exits with failure.
> * Another, unrelated block `Q` is evaluated and `ProcessBlock()` successfully calls `ThreadEdge()`. With the SSA updates they added a `DDT->flush()` inside `ThreadEdge()` and would flush to `DT`. This patch is currently reverted, but can come back. It's also possible anyone can add a patch to add `DDT->flush()` on another call path under `ProcessBlock()`.
> * `runImpl()` comes back to processing `A` again and goes to the end of `JumpThreading::runImpl()`. It sees `A` has a single successor `A'` and calls ` TryToSimplifyUncondBranchFromEmptyBlock()` which re-fuses `A`.
> 
> If you defer your updates across this entire set of changes you will see:
> ```
> // CreateBB(A');
> DTU.insertEdge(A, A');
> // Give up on A
> ...
> // Work on Q, call a flush() in ThreadEdge() or somehwere else under ProcessBlock()
> ...
> // Back on A
> DTU.deleteEdge(A, A');
> DTU.deleteBB(A');
> ```
> The interactions between `JumpThreading`, the state of the IR, the current state of the `DT`, and the state of pending `DDT` updates can lead to very complex states between them. You cannot safely assume no one will do updates that don't make sense or seem to be extra work. It's also why I added the Inversion check early on. For large functions detecting those early helped save a few percent of performance time.  The more real world code you run through your algorithm will verify if your assumptions are valid.
Further comments on your hypothetical code:
```
// Lazy DTU/DDT
// Edge A->B already in the CFG + DT knows that
DTU.insertEdge(A, B); // Get queued {Insert, A, B}
... // delete A->B in the CFG
DTU.deleteEdge(A, B); // Cancel with the above one {Delete, A, B} cancels {Insert, A, B}
assert (DTU.getDT().verify()); // Asserts! (DT doesn't know A->B is gone!)
```
The reason I added the silent checks against the CFG were exactly because of cases like this. I don't remember if they were in `ProcessBlock()` or somewhere under `Local.cpp` (which is called less in the current implementation).

This is why `applyUpdates` checked things in a specific order:
# do not enqueue duplicates within the `ArrayRef`. The call to `applyUpdate()` is expensive and if we can avoid 1 or 2 calls it's worth checking first.
# Check edge state in the cfg and silent discard updates that don't reflect the current IR state
# Check for duplicated update: discard. The internal queue might have the update that the batch call did not.
# Check for the inverted update: discard both.

Every one of these checks I hit with code through `JumpThreading` and some of them are there to prevent `assert()` failures.

I hope this answers your concern and I apologize if I did not answer your question. I'm struggling a bit to understand why we may not be understanding each other.

https://reviews.llvm.org/D48967