[PATCH] D113424: ext-tsp basic block layout

Tue Nov 16 08:21:57 PST 2021

spupyrev marked 18 inline comments as done.
spupyrev added a comment.

> Re memory utilization:

I am not sure what exactly the concern is. We keep all the allocated objects as the fields of ExtTSPImpl; no new objects should be created. When we merge two chains, all the jumps from one chain are moved to another one; and one of the chains is cleared. So the overall space is proportional to (|blocks| + |jumps|), unless there is a bug.

To validate correctness, I ran the algorithm on large synthetic instances (with up to 100K blocks); the memory didn't go beyond a few tens of MBs. In fact, it is hard to exactly measure the memory consumption, as the runtime of the implementation is super-linear and doesn't scale to huge instances.

> Re perf tests on IR/CSIR:

Here are my measurements for clang-10 binary with different PGO modes. Of course we should keep in mind that the numbers won't fully generalize to other binaries/benchmarks and only provide one data point.

Comparing clang-10 using 200 iterations on input1.cpp

  [test] -mllvm -enable-ext-tsp-block-placement=1
  [control] -mllvm -enable-ext-tsp-block-placement=0
  -DLLVM_BUILD_INSTRUMENTED=FE
    task-clock      :         5942 (+- 0.19%) vs         6027 (+- 0.19%) [diff = -1.4170%]
    binary_size     :         3316628472      vs         3325095000      [diff = -0.2546%]
  -DLLVM_BUILD_INSTRUMENTED=IR
    task-clock      :         5793 (+- 0.20%) vs         5876 (+- 0.20%) [diff = -1.4050%]
    binary_size     :         3305880848      vs         3325234728      [diff = -0.5820%]
  -DLLVM_BUILD_INSTRUMENTED=CSIR
    task-clock      :         5705 (+- 0.26%) vs         5757 (+- 0.26%) [diff = -0.8985%]
    binary_size     :         3312445120      vs         3329071760      [diff = -0.4994%]
  CSSPGO
    task-clock      :         6110 (+- 0.37%) vs         6226 (+- 0.37%) [diff = -1.8617%]
    binary_size     :         3995536616      vs         3980411480      [diff = +0.3799%]
  AutoFDO
    task-clock      :         5711 (+- 0.19%) vs         5797 (+- 0.19%) [diff = -1.4882%]
    binary_size     :         3707959712      vs         3694537688      [diff = +0.3632%]

================
Comment at: llvm/lib/CodeGen/MachineBlockPlacement.cpp:3408
+    // Find a new placement and modify the layout of the blocks in the function.
+    applyExtTsp();
+
----------------
davidxl wrote:
> Why do the extTSP layout after the existing layout? It seems like a waste of compile time.
> 
> Or the intention is to keep the taildup functionality there? However the tail dup decisions depend on layout decisions, so it is probably better integrated with extTSP.  Also taildup's profile update may not be ideal affecting later layout score computation.
I don't have a strong opinion here but here is my intuition. Ext-tsp is designed to "improve" the existing layout with the help of profile data. When there are ties or when the profile data is absent, the algorithm fallbacks to the original ordering of basic blocks. Thus, we want the original order to be reasonable. So we first apply the existing machine block placement, which applies several sane tricks like loop rotation, and then improve it with ext-tsp.

Keeping the existing tail duplication also sounds reasonable to me; though this optimization could be applied separately. Btw, we are working on improving and replacing the existing taildup heuristic; i'll share diffs/prototypes if/when it produces good results.

================
Comment at: llvm/lib/Transforms/Utils/CodeLayout.cpp:234
+  // Adjacent chains and corresponding edges (lists of jumps).
+  std::vector<std::pair<Chain *, Edge *>> Edges;
+};
----------------
rahmanl wrote:
> How do we ensure the obsolete chains are removed from this structure? Would it make life easier if we define it as a map from the Chain pointers (or Chain ids) ?
I think this was done for optimizing the performance. The tradeoff here is finding an adjacent edge (via getEdge(Chain *Other)) vs iterating over and appending the edges. (The former is done only once, when the two chains are merged). Empirically, std::vector+linear search was faster than std::unordered_map on my benchmarks by 10%-15%. Probably an alternative map implementation may work better; not sure if that's worth it though.

================
Comment at: llvm/lib/Transforms/Utils/CodeLayout.cpp:634
+
+    // Attach (a part of) ChainPred before the first block of ChainSucc
+    for (auto &Jump : ChainSucc->blocks().front()->InJumps) {
----------------
rahmanl wrote:
> I see the following two loops try to form a new fallthrough between ChainPred and ChainSucc and this is independent from the size of the ChainPred. This could be an independent optimization. Do you think you can do this as a later patch? That way, we can evaluate its benefit /overhead in a better way.
I don't see a clear advantage of moving this optimization to a separate diff. It is a relatively minor aspect of the algorithm and responsible for just a few lines of code. Not sure if the separation would noticeable simplify reviewing.

I do agree with another point here: The algorithm may benefit from some fine-tuning. I don't think the currently utilized parameters and options (e.g., how exactly we split/merge the chains) are optimal; they have been tuned a few years ago on a couple of workloads running on a particular hardware. I would really appreciate the community help.

For now, I'd however keep the params and optimizations as is, so that the existing customers do not see sudden regressions.

================
Comment at: llvm/test/CodeGen/X86/code_placement_ext_tsp.ll:1
+; RUN: llc -mcpu=corei7 -mtriple=x86_64-linux -enable-ext-tsp-block-placement=1 < %s | FileCheck %s
+
----------------
davidxl wrote:
> rahmanl wrote:
> > The interesting code path (splitting the chain) is not exercised by these tests. Please add a test for that.
> The extTSP algorithm should be able to handle common loop rotation case. Can you add a test case about it?
Added some (possibly naive) test. Let me know if you have something more complex in mind.

================
Comment at: llvm/test/CodeGen/X86/code_placement_ext_tsp.ll:1
+; RUN: llc -mcpu=corei7 -mtriple=x86_64-linux -enable-ext-tsp-block-placement=1 < %s | FileCheck %s
+
----------------
spupyrev wrote:
> davidxl wrote:
> > rahmanl wrote:
> > > The interesting code path (splitting the chain) is not exercised by these tests. Please add a test for that.
> > The extTSP algorithm should be able to handle common loop rotation case. Can you add a test case about it?
> Added some (possibly naive) test. Let me know if you have something more complex in mind.
I added some toy test, let me know if you have something more complex in mind.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D113424/new/

https://reviews.llvm.org/D113424