[PATCH] D113424: ext-tsp basic block layout

Fri Nov 19 15:36:11 PST 2021

spupyrev added inline comments.

================
Comment at: llvm/lib/CodeGen/MachineBlockPlacement.cpp:3408
+    // Find a new placement and modify the layout of the blocks in the function.
+    applyExtTsp();
+
----------------
spupyrev wrote:
> davidxl wrote:
> > spupyrev wrote:
> > > davidxl wrote:
> > > > Why do the extTSP layout after the existing layout? It seems like a waste of compile time.
> > > > 
> > > > Or the intention is to keep the taildup functionality there? However the tail dup decisions depend on layout decisions, so it is probably better integrated with extTSP.  Also taildup's profile update may not be ideal affecting later layout score computation.
> > > I don't have a strong opinion here but here is my intuition. Ext-tsp is designed to "improve" the existing layout with the help of profile data. When there are ties or when the profile data is absent, the algorithm fallbacks to the original ordering of basic blocks. Thus, we want the original order to be reasonable. So we first apply the existing machine block placement, which applies several sane tricks like loop rotation, and then improve it with ext-tsp.
> > > 
> > > Keeping the existing tail duplication also sounds reasonable to me; though this optimization could be applied separately. Btw, we are working on improving and replacing the existing taildup heuristic; i'll share diffs/prototypes if/when it produces good results.
> > > 
> > When you say extTSP relies on existing layout to be reasonable, do you have performance data to back it up? If the profile is not available, it is likely they are cold.    
> > 
> > Also extTSP should be able to do 'loop rotation' by splitting during chain merge, are there cases it is not handled?
> Running ext-tsp as a postprocessing is a relatively minor "optimization" (that's why I said "I don't have strong opinion here..."). If you think we should run it instead of the existing algorithm, i am happy to do so.
> 
> It is mostly to avoid some corner cases, e.g.: A user supplies an incomplete (or even empty) sampling-based profile, which only covers a part of hot functions. In that case ext-tsp doesn't modify the layout but the existing MachineBlockPlacement can apply some tricks. For instr-PGO, that's not an issue.
> Another scenario is when there are "ties", e.g., block orders with equal objective. For example, a function with 3 blocks and 2 equal-weight jumps, A->B, A->C. In that case, ext-tsp cannot decide between [A, B, C] and [A, C, B], and we probably want to use some other heuristics.
> 
> I am not aware of "simple" cases when ext-tsp produces sub-optimal result; it will rotate loops, if that's dictated by profile counts. However, it is a heuristic for an np-hard problem so we should not expect it to work optimally in 100% of cases.
I just tried to to measure the impact of changing the algorithm from being a post-processing step to replacing the existing approach (buildCFGChains). Here are the results:

```
benchmark1:   task-clock [delta: 24.27 ± 18.78, delta(%): 0.4010 ± 0.3103, p-value: 0.103827]
benchmark2:   task-clock [delta: 54.77 ± 20.91, delta(%): 0.9070 ± 0.3463, p-value: 0.000044]
benchmark3:   task-clock [delta: 2.43 ± 20.20, delta(%): 0.0279 ± 0.2314, p-value: 0.680156]
```
So for two benchmarks, the difference is non-statsig, for one, there is a ~0.9% regression. I feel like we should keep the implementation as is.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D113424/new/

https://reviews.llvm.org/D113424