[PATCH] D113424: ext-tsp basic block layout

Thu Nov 18 19:06:14 PST 2021

rahmanl added a comment.

In D113424#3135015 <https://reviews.llvm.org/D113424#3135015>, @spupyrev wrote:

>> Re memory utilization:
>
> I am not sure what exactly the concern is. We keep all the allocated objects as the fields of ExtTSPImpl; no new objects should be created. When we merge two chains, all the jumps from one chain are moved to another one; and one of the chains is cleared. So the overall space is proportional to (|blocks| + |jumps|), unless there is a bug.
>
> To validate correctness, I ran the algorithm on large synthetic instances (with up to 100K blocks); the memory didn't go beyond a few tens of MBs. In fact, it is hard to exactly measure the memory consumption, as the runtime of the implementation is super-linear and doesn't scale to huge instances.

Please see my inline comments. Since we keep all the Chains in the vector (they're not destructed), space usage is currently super-linear. We have a choice of deallocating the space (using shrink_to_fit, resize, or swap) and that's a time-space tradeoff. I am not sure which one is better, since this is already super-linear time, it doesn't hurt to make those calls and keep the space linear. Also, you can use `llc` on a large function to compare memory usage with and without `-enable-ext-tsp-block-placement`. It might very well be the case that the memory consumption remains the same. In that case, maybe you don't want to free up the storage either.

>> Re perf tests on IR/CSIR:
>
> Here are my measurements for clang-10 binary with different PGO modes. Of course we should keep in mind that the numbers won't fully generalize to other binaries/benchmarks and only provide one data point.
>
> Comparing clang-10 using 200 iterations on input1.cpp
>
>   [test] -mllvm -enable-ext-tsp-block-placement=1
>   [control] -mllvm -enable-ext-tsp-block-placement=0
>   -DLLVM_BUILD_INSTRUMENTED=FE
>     task-clock      :         5942 (+- 0.19%) vs         6027 (+- 0.19%) [diff = -1.4170%]
>     binary_size     :         3316628472      vs         3325095000      [diff = -0.2546%]
>   -DLLVM_BUILD_INSTRUMENTED=IR
>     task-clock      :         5793 (+- 0.20%) vs         5876 (+- 0.20%) [diff = -1.4050%]
>     binary_size     :         3305880848      vs         3325234728      [diff = -0.5820%]
>   -DLLVM_BUILD_INSTRUMENTED=CSIR
>     task-clock      :         5705 (+- 0.26%) vs         5757 (+- 0.26%) [diff = -0.8985%]
>     binary_size     :         3312445120      vs         3329071760      [diff = -0.4994%]
>   CSSPGO
>     task-clock      :         6110 (+- 0.37%) vs         6226 (+- 0.37%) [diff = -1.8617%]
>     binary_size     :         3995536616      vs         3980411480      [diff = +0.3799%]
>   AutoFDO
>     task-clock      :         5711 (+- 0.19%) vs         5797 (+- 0.19%) [diff = -1.4882%]
>     binary_size     :         3707959712      vs         3694537688      [diff = +0.3632%]

Thanks for sharing the detailed results.

================
Comment at: llvm/lib/Transforms/Utils/CodeLayout.cpp:262-263
+  void clear() {
+    Blocks.clear();
+    Edges.clear();
+  }
----------------
Although this clears the vector, the underlying storage is not released. To free up the storage, we would need additional effort. (shrink_to_fit, resize, or swap).

================
Comment at: llvm/lib/Transforms/Utils/CodeLayout.cpp:304
+    Jumps.insert(Jumps.end(), Other->Jumps.begin(), Other->Jumps.end());
+    Other->Jumps.clear();
+  }
----------------
Same comment. Storage won't be released.

================
Comment at: llvm/lib/Transforms/Utils/CodeLayout.cpp:659
+  /// element being the corresponding merging type.
+  MergeGainTy mergeGain(Chain *ChainPred, Chain *ChainSucc,
+                        ChainEdge *Edge) const {
----------------
Please rename this to GetBestMergeGain for consistency with the function comment and to further separate it from computeMergeGain.

================
Comment at: llvm/lib/Transforms/Utils/CodeLayout.cpp:127
+  bool operator<(const MergeGainTy &Other) const {
+    return (Other.Score > EPS && Other.Score > Score + EPS);
+  }
----------------
This can be removed if we initialize with zero.

================
Comment at: llvm/lib/Transforms/Utils/CodeLayout.cpp:137
+private:
+  double Score{-1.0};
+  size_t MergeOffset{0};
----------------
Does this initialization serve a special purpose? We can make it zero if we always reject zero score for merging.

================
Comment at: llvm/lib/Transforms/Utils/CodeLayout.cpp:445
+private:
+  /// Initialize algorithm's data structures.
+  void initialize(const NodeSizeMap &NodeSizes, const NodeCountMap &NodeCounts,
----------------

================
Comment at: llvm/lib/Transforms/Utils/CodeLayout.cpp:838
+                     [&](const Chain *C1, const Chain *C2) {
+                       // Original entry point to the front
+                       if (C1->isEntry() != C2->isEntry()) {
----------------
Change comment to clearly mention this is for putting the entry block at the beginning.

================
Comment at: llvm/lib/Transforms/Utils/CodeLayout.cpp:840-843
+                         if (C1->isEntry())
+                           return true;
+                         if (C2->isEntry())
+                           return false;
----------------
We can simply use `return C1->isEntry()`?

================
Comment at: llvm/lib/Transforms/Utils/CodeLayout.cpp:847-852
+                       const double D2 = ChainDensity[C2];
+                       if (D1 != D2)
+                         return D1 > D2;
+
+                       // Making the order deterministic
+                       return C1->id() < C2->id();
----------------
How about shortening this to `return (D1 != D2) ? (D1 > D2) : (C1->id() < C2->id())`?

================
Comment at: llvm/lib/Transforms/Utils/CodeLayout.cpp:634
+
+    // Attach (a part of) ChainPred before the first block of ChainSucc
+    for (auto &Jump : ChainSucc->blocks().front()->InJumps) {
----------------
spupyrev wrote:
> rahmanl wrote:
> > I see the following two loops try to form a new fallthrough between ChainPred and ChainSucc and this is independent from the size of the ChainPred. This could be an independent optimization. Do you think you can do this as a later patch? That way, we can evaluate its benefit /overhead in a better way.
> I don't see a clear advantage of moving this optimization to a separate diff. It is a relatively minor aspect of the algorithm and responsible for just a few lines of code. Not sure if the separation would noticeable simplify reviewing.
> 
> I do agree with another point here: The algorithm may benefit from some fine-tuning. I don't think the currently utilized parameters and options (e.g., how exactly we split/merge the chains) are optimal; they have been tuned a few years ago on a couple of workloads running on a particular hardware. I would really appreciate the community help.
> 
> For now, I'd however keep the params and optimizations as is, so that the existing customers do not see sudden regressions.
Agreed. I am mostly interested in separately analyzing the impact of this. I think the user can do so by setting the split parameter to be zero.

One suggestion to make this function more compact: Please consider adding a helper lambda function that given an offset and a list of merge types, does the necessary checks for the offset (like `PredChain->blocks[Offset]->ForcedSucc != nullptr`) and calls `computeMergeGain` for each merge type. I think we can make this function more readable and concise as well.

================
Comment at: llvm/lib/Transforms/Utils/CodeLayout.cpp:683
+        // Try to split the chain in different ways
+        Gain = computeMergeGain(Gain, ChainPred, ChainSucc, Jumps, Offset,
+                                MergeTypeTy::X1_Y_X2);
----------------
rahmanl wrote:
> Why not doing X2_Y_X1 here?
I am still puzzled by why we don't do X2_Y_X1. Is it not beneficial?

================
Comment at: llvm/test/CodeGen/X86/code_placement_ext_tsp.ll:1
+; RUN: llc -mcpu=corei7 -mtriple=x86_64-linux -enable-ext-tsp-block-placement=1 < %s | FileCheck %s
+
----------------
spupyrev wrote:
> spupyrev wrote:
> > davidxl wrote:
> > > rahmanl wrote:
> > > > The interesting code path (splitting the chain) is not exercised by these tests. Please add a test for that.
> > > The extTSP algorithm should be able to handle common loop rotation case. Can you add a test case about it?
> > Added some (possibly naive) test. Let me know if you have something more complex in mind.
> I added some toy test, let me know if you have something more complex in mind.
Thanks for adding the test case. I was thinking of something less complex which exercises the splitting to improve fallthroughs. (Might be worthwhile to check if the old algorithm gets the optimal layout in this case).

         foo
           |   \
           |     \10
           |       \
           |         v
           |17       foo.bb1
           |          / 
           |        /  10
           v      v 
         foo.bb2

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D113424/new/

https://reviews.llvm.org/D113424