[PATCH] D103289: A post-processing for BFI inference

Wed Jun 9 10:17:33 PDT 2021

spupyrev marked 3 inline comments as done.
spupyrev added inline comments.

================
Comment at: llvm/include/llvm/Analysis/BlockFrequencyInfoImpl.h:1495
+    if (Change > Threshold) {
+      ActiveSet.push(I);
+      IsActive[I] = true;
----------------
hoy wrote:
> Wondering if it makes sense to not set `I` active. When `I` gets an noticeable update on its counts, its successors should be reprocessed thus they should be set active. But not sure `I` itself should be reprocessed.
This is a very good question, thanks! 

I had exactly the same feeling and tried to modify this part as suggested. Unfortunately, it does result in (significantly) slower convergence in some instances, while not providing noticeable benefits. I don't have a rigorous explanation (the alg is a heuristic anyway), but here is my intuition:

We update frequencies of blocks in some order, which is dictated by `ActiveSet` (currently that's simply a queue). This order does affect the speed of convergence: For example, we want to prioritize updates of frequencies of blocks that are a part of a hot loop. If at an iteration we modify frequency of `I`, then there is a higher chance that block `I` will need to be updated later. Thus, we explicitly add it to the queue so that it's updated again as soon as possible.

There are likely alternative strategies here, e.g., having some priority-based queues and/or smarter strategy for deciding when `I` needs to be updated. I played quite a bit with various versions but couldn't get significant wins over the default (simplest) strategy. So let's keep this question as a future work.

================
Comment at: llvm/include/llvm/Analysis/BlockFrequencyInfoImpl.h:1595
+      // Ignore parallel edges between BB and SI blocks
+      if (UniqueSuccs.find(SI) != UniqueSuccs.end())
+        continue;
----------------
hoy wrote:
> Should the probability of parallel edges be accumulated?
In my tests, I see parallel edges are always coming with exactly the same probability, and their sum might exceed 1.0. I guess that's an assumption/invariant used in BPI. 

================
Comment at: llvm/include/llvm/Analysis/BlockFrequencyInfoImpl.h:1489
+    }
+    if (OneMinusSelfProb != Scaled64::getOne())
+      NewFreq /= OneMinusSelfProb;
----------------
wenlei wrote:
> davidxl wrote:
> > spupyrev wrote:
> > > davidxl wrote:
> > > > spupyrev wrote:
> > > > > davidxl wrote:
> > > > > > spupyrev wrote:
> > > > > > > davidxl wrote:
> > > > > > > > spupyrev wrote:
> > > > > > > > > davidxl wrote:
> > > > > > > > > > spupyrev wrote:
> > > > > > > > > > > davidxl wrote:
> > > > > > > > > > > > Does it apply to other backedges too?
> > > > > > > > > > > not sure I fully understand the question, but we need an adjustment only for self-edges; blocks without self-edges don't need any post-processing
> > > > > > > > > > > 
> > > > > > > > > > > I added a short comment before the loop
> > > > > > > > > > NewFreq /= OneMinusSelfProb looks like multiply the block freq (one iteration loop) with the average trip count -- that is why I asked if this applies to other backedges.
> > > > > > > > > Here is the relevant math:
> > > > > > > > > 
> > > > > > > > > we want to find a new frequency for block I, `Freq[I]`, such that it is equal to `\sum Freq[J] * Prob[J][I]`, where the sum is taken over all (incoming) jumps (J -> I). These are "ideal" frequencies that BFI is trying to compute.
> > > > > > > > > 
> > > > > > > > > Clearly if I-th block has no self-edges, then we simply assign `Freq[I]:=\sum Freq[J] * Prob[J][I]` (that is, no adjustment). However, if there are self_edges, we need to assign `Freq[I]:=(\sum Freq[J] * Prob[J][I]) / (1 - Prob[I][I])` (the adjustment in the code)
> > > > > > > > I wonder why the special treatment is needed in the first place.
> > > > > > > > 
> > > > > > > > Suppose we have 
> > > > > > > > 
> > > > > > > >  ```
> > > > > > > >  BB1  (init freq = 50)
> > > > > > > >      |
> > > > > > > >      V  <-----------------
> > > > > > > >     BB2 (int freq = 0)   |
> > > > > > > >     /      \ 90%              |
> > > > > > > >    / 10%\____________|
> > > > > > > > <
> > > > > > > > 
> > > > > > > > ```
> > > > > > > > 
> > > > > > > > With iterative fixup, BB2's frequency will converge to 500, which is the right value without any special handling.
> > > > > > > Excellent example!
> > > > > > > 
> > > > > > > The correct inference here is `Freq[BB1] = 50, Freq[BB2] = 500`, which is found after 5 iterations using the diff. If we remove the self-edge adjustment, we don't get the right result: it converges to `Freq[BB1] = 50, Freq[BB2] = 50` after ~100 iterations. (Observe that we do modify the frequency of the entry block, it is not fixed)
> > > > > > > 
> > > > > > > In general, I do not have a proof that the Markov chain always converges to the desired stationary point, if we incorrectly update frequencies (e.g., w/o the self-edge adjustment) -- I suspect it does not.
> > > > > > By entry frequency, do you mean BB1's frequency? BB1 won't be active after the first iteration right?
> > > > > Yes I meant BB1's frequency.
> > > > > 
> > > > > Notice that in order to create a valid Markov chain, we need to add jumps from all exists to the entry. In this case, from BB2 to BB1. So BB1 will be active on later iterations
> > > > Can you verify if it still works without the adjustment: in the small example, split BB2 into two BBs.
> > > I've commented above:
> > > 
> > > > If we remove the self-edge adjustment, we don't get the right result: it converges to Freq[BB1] = 50, Freq[BB2] = 50 after ~100 iterations. 
> > > > In general, I do not have a proof that the Markov chain always converges to the desired stationary point, if we incorrectly update frequencies (e.g., w/o the self-edge adjustment) -- I suspect it does not.
> > > 
> > > What is the concern/question here? In my mind, this is not a "fix/hack" but the correct way of applying iterative inference.
> > There is not much concerns and the patch is almost good to go in. Just want to make sure the algo works for all cases.
> Does self probability map to damping factor in original page rank? 
No I don't think damping factor is the same.

Self-edges are regular jumps in CFG where the source and the destination blocks coincide. (they are not super frequent e.g., in SPEC, but do appear sometimes). You can always replace a self-edge from block B1->B1 with two jumps B1->BX and BX->B1, where BX is a "dummy" block containing exactly one incoming and one outgoing jump. Then the inference problem on the new modified CFG (that contains no self-edges) is equivalent to the original problem. This transformation also shows that we cannot simply ignore self-edges, as the inference result might change.

================
Comment at: llvm/include/llvm/Analysis/BlockFrequencyInfoImpl.h:1388
+  // Extract initial frequencies for the blocks
+  DenseMap<const BlockT *, size_t> BlockIndex;
+  auto Freq = std::vector<Scaled64>(HotBlocks.size());
----------------
wenlei wrote:
> spupyrev wrote:
> > davidxl wrote:
> > > why is this map needed (which adds a layer of indirection)?
> > The map is used to index successors/predecessors of "hot" blocks, see line 1603.
> > 
> > As an optimization, we don't process all the blocks in a function but only those that can be reached from the entry via branches with a positive probability. These are `HotBlocks` in the code. Typically, the number of HotBlocks is 2x-5x smaller than the total number of blocks in the function. In order to find an index of a block within the list, we either need to do a linear scan over HotBlocks, or have such an extra map.
> I think we could avoid having the index and extra map if the ProbMatrix (and other data structure) use block pointer instead of index as block identifier, and still remove cold blocks in the processing - i.e. replacing various vectors with map<BasicBlock*, ..>. I think that may be slightly more readable, but using index as identifier is closer to the underlying math.. Either way is fine to me.
Sure that's an option but map access is costlier than using raw indices. Since the hottest loop of the implementation needs such an access, (i bet) the change will yield perf loss

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D103289/new/

https://reviews.llvm.org/D103289